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Abstract 

Bayesian field theory denotes a nonparametric Bayesian approach 
for learning functions from observational data. Based on the principles 
of Bayesian statistics, a particular Bayesian field theory is defined by 
combining two models: a likelihood model, providing a probabilistic 
description of the measurement process, and a prior model, providing 
the information necessary to generalize from training to non-training 
data. The particular likelihood models discussed in the paper are 
those of general density estimation, Gaussian regression, clustering, 
classification, and models specific for inverse quantum problems. Be- 
sides problem typical hard constraints, like normalization and non- 
negativity for probabilities, prior models have to implement all the 
specific, and often vague, a priori knowledge available for a specific 
task. Nonparametric prior models discussed in the paper are Gaussian 
processes, mixtures of Gaussian processes, and non-quadratic poten- 
tials. Prior models are made flexible by including hyperparameters. 
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In particular, the adaption of mean functions and covariance oper- 
ators of Gaussian process components is discussed in detail. Even 
if constructed using Gaussian process building blocks, Bayesian field 
theories are typically non-Gaussian and have thus to be solved nu- 
merically. According to increasing computational resources the class 
of non-Gaussian Bayesian field theories of practical interest which 
are numerically feasible is steadily growing. Models which turn out 
to be computationally too demanding can serve as starting point to 
construct easier to solve parametric approaches, using for example 
variational techniques. 
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1 Introduction 



The last decade has seen a rapidly growing interest in learning from observa- 
tional data. Increasing computational resources enabled successful applica- 
tions of empirical learning algorithms in various areas including, for example, 
time series prediction, image reconstruction, speech recognition, computer to- 
mography, and inverse scattering and inverse spectral problems for quantum 
mechanical systems. Empirical learning, i.e., the problem of finding underly- 
ing general laws from observations, represents a typical inverse problem and 
is usually ill-posed in the sense of Hadamard |15], |l|, [219], fj35|, [DJ, |21|. It 



is well known that a successful solution of such problems requires additional 
a priori information. It is a priori information which controls the general- 
ization ability of a learning system by providing the link between available 
empirical "training" data and unknown outcome in future "test" situations. 

We will focus mainly on nonparametric approaches, formulated directly 
in terms of the function values of interest. Parametric methods, on the other 
hand, impose typically implicit restrictions which are often extremely diffi- 
cult to relate to available a priori knowledge. Combined with a Bayesian 
framework [0, |l|, |33], |14|, [T97|, [m], 0, ||, |07|, ^ |3(], |42|, |T05|, pjoj, a 



nonparametric approach allows a very flexible and interpretable implemen- 
tation of a priori information in form of stochastic processes. Nonparamet- 
ric Bayesian methods can easily be adapted to different learning situations 
and have therefore been applied to a variety of empirical learning problems, 
including regression, classification, density estimation and inverse quantum 
problems | 167 , 232 , 14^ , 141 , 137 , [2171| . Technically, they are related to kernel 
and regularization methods which often appear in the form of a roughness 
penalty approach |ZT!| [2T|, [L87|, |206|, |150|, |22|, 0, || 



116, 221]. Compu- 



tationally, working with stochastic processes, or discretized versions thereof, 
is more demanding than, for example, fitting a small number of parame- 
ters. This holds especially for such applications where one cannot take full 
advantage of the convenient analytical features of Gaussian processes. Nev- 
ertheless, it seems to be the right time to study nonparametric Bayesian 
approaches also for non-Gaussian problems as they become computationally 
feasible now at least for low dimensional systems and, even if not directly 
solvable, they provide a well defined basis for further approximations. 
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In this paper we will in particular study general density estimation prob- 
lems. Those include, as special cases, regression, classification, and certain 
types of clustering. In density estimation the functions of interest are the 
probability densities p(y\x,h), of producing output ("data") y under con- 
dition x and unknown state of Nature h. Considered as function of h, for 
fixed y, x, the function p(y\x, h) is also known as likelihood function and a 
Bayesian approach to density estimation is based on a probabilistic model 
for likelihoods p(y\x, h). We will concentrate on situations where y and x are 
real variables, possibly multi-dimensional. In a nonparametric approach, the 
variable h represents the whole likelihood function p(y\x, h). That means, h 
may be seen as the collection of the numbers < p(y\x, h) < 1 for all x and 
all y. The dimension of h is thus infinite, if the number of values which the 
variables x and/or y can take is infinite. This is the case for real x and/or y. 

A learning problem with discrete y variable is also called a classifica- 
tion problem. Restricting to Gaussian probabilities p(y\x, h) with fixed vari- 
ance leads to (Gaussian) regression problems. For regression problems the 
aim is to find an optimal regression function h(x). Similarly, adapting a 
mixture of Gaussians allows soft clustering of data points. Furthermore, 
extracting relevant features from the predictive density p(y\x, data) is the 
Bayesian analogue of unsupervised learning. Other special density estimation 
problems are, for example, inverse problems in quantum mechanics where h 
represents a unknown potential to be determined from observational data 
142, 141, |137] , 217| . Special emphasis will be put on the explicit and flexible 



implementation of a priori information using, for example, mixtures of Gaus- 
sian prior processes with adaptive, non-zero mean functions for the mixture 
components. 

Let us now shortly explain what is meant by the term "Bayesian Field 
Theory": From a physicists point of view functions, like h(x,y) = p(y\x, h), 
depending on continuous variables x and/or y, are often called a 'field'.Q 
Most times in this paper we will, as common in field theories in physics, not 
parameterize these fields and formulate the relevant probability densities or 
stochastic processes, like the prior p{h) or the posterior p(h\f), directly in 
terms of the field values h(x, y), e.g., p{h\f) = p(h(x, y),x G X, y G Y\f). (In 
the parametric case, discussed in Chapter 0, we obtain a probability density 



We may also remark that for example statistical field theories, which encompass quan- 
tum mechanics and quantum held theory in th eir Euclidean formulation, are technically 



similar to a nonparametric Bayesian approach [244, 103, |l2 
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p(h\f) = p(£\f) for fields h(x,y,£) parameterized by £.) 

The possibility to solve Gaussian integrals analytically makes Gaussian 
processes, or (generalized) free fields in the language of physicists, very at- 
tractive for nonparametric learning. Unfortunately, only the case of Gaussian 
regression is completely Gaussian. For general density estimation problems 
the likelihood terms are non-Gaussian, and even for Gaussian priors addi- 
tional non-Gaussian restrictions have to be included to ensure non-negativity 
and normalization of densities. Hence, in the general case, density estimation 
corresponds to a non-Gaussian, i.e., interacting field theory. 

As it is well known from physics, a continuum limit for non-Gaussian the- 
ories, based on the definition of a renormalization procedure, can be highly 
nontrivial to construct. (See |^U|, [5| for an renormalization approach to den- 
sity estimation.) We will in the following not discuss such renormalization 
procedures but focus more on practical, numerical learning algorithms, ob- 
tained by discretizing the problem (typically, but not necessarily in coordi- 
nate space). This is similar, for example, to what is done in lattice field 
theories. 

Gaussian problems live effectively in a space with dimension not larger 
than the number of training data. This is not the case for non-Gaussian 
problems. Hence, numerical implementations of learning algorithms for non- 
Gaussian problems require to discretize the functions of interest. This can 
be computationally challenging. 

For low dimensional problems, however, many non-Gaussian models are 
nowadays solvable on a standard PC. Examples include predictions of one- 
dimensional time series or the reconstruction of two-dimensional images. 
Higher dimensional problems require additional approximations, like projec- 
tions into lower dimensional subspaces or other variational approaches. In- 
deed, it seems that a most solvable high dimensional problems live effectively 
in some low dimensional subspace. 

There are special situations in classification where non-negativity and 
normalization constraints are fulfilled automatically. In that case, the cal- 
culations can still be performed in a space of dimension not larger than the 
number of training data. Contrasting Gaussian models, however the equa- 
tions to be solved are then typically nonlinear. 

Summarizing, we will call a nonparametric Bayesian model to learn a 
function one or more continuous variables a Bayesian field theory, having 
especially in mind non-Gaussian models. A large variety of Bayesian field 
theories can be constructed by combining a specific likelihood models with 
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specific functional priors (see Tab. |T|). The resulting flexibility of nonpara- 
metric Bayesian approaches is probably their main advantage. 



likelihood model 


prior model 


describes 


measurement process (Chap. |2|) 


generalization behavior (Chap. 


is determined by 


parameters (Chap. [3], £|) 


hyperparameters (Chap. |5|) 


Examples include 


density estimation(Sects. |3?l|-3.6, |6.2|) 
regression (Sects. |3TT|, |Q|) 
classification (Sect. [3~8l) 
inverse quantum theory (Sect. |3.9|) 


hard constraints (Chap. |2|) 
Gaussian prior factors (Chap. |3]) 
mixtures of Gauss. (Sects. |67[H6.4|) 
non-quadratic potentials(Sect. p.5|) 


Learning algorithms are treated in Chapter |7]. 



Table 1: A Bayesian approach is based on the combination of two models, 
a likelihood model, describing the measurement process used to obtain the 
training data, and a prior model, enabling generalization to non-training 
data. Parameters of the prior model are commonly called hyperparameters. 
In "nonparametric" approaches the collection of all values of the likelihood 
function itself are considered as the parameters. A nonparametric Bayesian 
approach for likelihoods depending on one or more real variables is in this 
paper called a Bayesian field theory. 
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The paper is organized as follows: Chapter ^| summarizes the Bayesian 
framework as needed for the subsequent chapters. Basic notations are de- 
fined, an introduction to Bayesian decision theory is given, and the role of 
a priori information is discussed together with the basics of a Maximum 
A Posteriori Approximation (MAP), and the specific constraints for density 
estimation problems. Gaussian prior processes, being the most commonly 
used prior processes in nonparametric statistics, are treated in Chapter |3|. 
In combination with Gaussian prior models, this section also introduces the 
likelihood models of density estimation, (Sections [3J], |3^2| , |373|) Gaussian re- 
gression and clustering (Section |3.7| ), classification (Section |3.8| ), and inverse 
quantum problems (Section |3.9|) . Notice, however, that all these likelihood 
models can also be combined with the more elaborated prior models dis- 
cussed in the following sections of the paper. Parametric approaches, useful 
if a numerical solution of a full nonparametric approach is not feasible, are 
the topic of Chapter |j. Hyperparameters, parameterizing prior processes 
and making them more flexible, are considered in Section |5|. Two possibil- 
ities to go beyond Gaussian processes, mixture models and non-quadratic 
potentials, are presented in Section ||. Chapter [7| focuses on learning algo- 
rithms, i.e., on methods to solve the stationarity equations resulting from 
a Maximum A Posteriori Approximation. In this section one can also find 
numerical solutions of Bayesian field theoretical models for general density 
estimation. 



2 Bayesian framework 

2.1 Basic model and notations 

2.1.1 Independent, dependent, and hidden variables 

Constructing theories means introducing concepts which are not directly ob- 
servable. They should, however, explain empirical findings and thus have to 
be related to observations. Hence, it is useful and common to distinguish 
observable (visible) from non-observable (hidden) variables. Furthermore, 
it is often convenient to separate visible variables into dependent variables, 
representing results of such measurements the theory is aiming to explain, 
and independent variables, specifying the kind of measurements performed 
and not being subject of the theory. 

Hence, we will consider the following three groups of variables 
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1. observable (visible) independent variables x, 



2. observable (visible) dependent variables y, 

3. not directly observable (hidden, latent) variables h. 

This characterization of variables translates to the following factorization 
property, defining the model we will study, 

p{x, y, h) = p(y\x, h) p(x) p(h). (1) 

In particular, we will be interested in scenarios where ) and 

analogously y = (yi,y<2, ■ • •) are decomposed into independent components, 
meaning that p(y\x,h) = YliP(yi\ x i, h) and p(x) = YiiP( x i) factorize. Then, 

p(x, y, h) = Ylp(yi\xi, h) p(xi) p{h). (2) 

i 

Fig.|I] shows a graphical representation of the factorization model @) as a 
directed acyclic graph ||182| , |125| , |107| , |196|1 . The Xi and/or yi itself can also 



be vectors. 

The interpretation will be as follows: Variables h G H represent possible 
states of (the model of) Nature, being the invisible conditions for dependent 
variables y. The set H defines the space of all possible states of Nature for 
the model under study. We assume that states h are not directly observable 
and all information about p(h) comes from observed variables (data) y, x. 
A given set of observed data results in a state of knowledge f numerically 
represented by the posterior density p(h\f) over states of Nature. 

Independent variables x G X describe the visible conditions (measure- 
ment situation, measurement device) under which dependent variables (mea- 
surement results) y have been observed (measured). According to Eq. ([I]) 
they are independent of h, i.e., p(x\h) = p(x). The conditional density 
p(y\x, h) of the dependent variables y is also known as likelihood of h (under y 
given x). Vector-valued y can be treated as a collection of one-dimensional y 
with the vector index being part of the x variable, i.e., y a (x) = y(x, a) = y(x) 
with x = (x, a). 

In the setting of empirical learning available knowledge is usually sep- 
arated into a finite number of training data D = {(xi,yi)\l < i < n} 
={{xd,Vd) and, to make the problem well defined, additional a priori in- 
formation D Q . For data D U D we write p(h\f) = p(h\D, D ). Hypotheses h 
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Figure 1: Directed acyclic graph for the factorization model ([!]). 



represent in this setting functions h(x,y) = p(y\x, h) of two (possibly multi- 
dimensional) variables y, x. In density estimation y is a continuous variable 
(the variable x may be constant and thus be skipped), while in classification 
problems y takes only discrete values. In regression problems on assumes 
p(y\x, h) to be Gaussian with fixed variance, so the function of interest be- 
comes the regression function h(x) = J dyyp(y\x,h). 

2.1.2 Energies, free energies, and errors 

Often it is more convenient to work with log-probabilities L = \np than with 
probabilities. Firstly, this ensures non-negativity of probabilities p = e L > 
for arbitrary L. (For p = the log-probability becomes L = — oo.) Thus, 
when working with log-probabilities one can skip the non-negativity con- 
straint which would be necessary when working with probabilities. Secondly, 
the multiplication of probabilities for independent events, yielding their joint 
probability, becomes a sum when written in terms of L. Indeed, from p(A, B) 
= p(A and B) = p(A)p(B) it follows for L(A, B) = \nP(A, B) that L(A, B) 
=L(A and B) = L(A)L(B). Especially in the limit where an infinite number 
of events is combined by and, this would result in an infinite product for p 
but yields an integral for L, which is typically easier to treat. 

Besides the requirement of being non-negative, probabilities have to be 
normalized, e.g., Jdxp(x) = 1. When dealing with a large set of elementary 
events normalization is numerically a nontrivial task. It is then convenient 
to work as far as possible with unnormalized probabilities Z(x) from which 
normalized probabilities are obtained asp(x) = Z(x)/Z with partition sum Z 
— J2 X Z(x). Like for probabilities, it is also often advantageous to work with 
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the logarithm of unnormalized probabilities, or to get positive numbers (for 
p(x) < 1) with the negative logarithm E(x) = — (1/0) lnZ(x), in physics also 
known as energy. (For the role of (3 see below.) Similarly, F = —(1/0) InZ 
is known as free energy. 

Defining the energy we have introduced a parameter Varying the 
parameter f3 generates an exponential family of densities which is frequently 
used in practice by (simulated or deterministic) annealing techniques for 
minimizing free energies |114|, [153], [195], [43], |, [T9|, |3|, [68|, p39|, piOfl. In 



physics P is known as inverse temperature and plays the role of a Lagrange 
multiplier in the maximum entropy approach to statistical physics. Inverse 
temperature j3 can also be seen as an external field coupling to the energy. 
Indeed, the free energy F is a generating function for the cumulants of the 
energy, meaning that cumulants of E can be obtained by taking derivatives 
of F with respect to f3 |T3L |160| . For a detailled discussion of the 

relations between probability, log-probability, energy, free energy, partition 



sums, generating functions, and also bit numbers and information see ||132 
The posterior p(h\f), for example, can so be written as 

p(h\f) = e^l/) = « = 

P[ U) Z(H\f) Z(H\f) 

= e -l3(E(h\f)-F(H\f)) _ e -/3E(h\f)+c(H\f) ^ ^ 

with (posterior) log-probability 

L(h\f) = \np(h\f), (4) 
unnormalized (posterior) probabilities or partition sums 

Z(h\f), Z(H\f)= JdhZ(h\f), (5) 

(posterior) energy 



E(h\f) = ~]uZ(h\f) (6) 



and (posterior) free energy 



F(H\f) = ~\nZ(H\f) (7) 
= -l]nfdhe-WW, (8) 
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yielding 



Z(h\f) = e -W/), (9) 
Z(H\f) = Jdhe-WW, ( 10 ) 

where / dh represent a (functional) integral, for example over variables (func- 
tions) h(x,y) — p(y\x,h), and 

c(H\f) = -\nZ(H\f)=PF(H\f). (11) 

Note that we did not include the /3-dependency of the functions Z, F, c in 
the notation. 

For the sake of clarity, we have chosen to use the common notation for 
conditional probabilities also for energies and the other quantities derived 
from them. The same conventions will also be used for other probabilities, 
so we will write for example for likelihoods 

p(y\x,h)=e-W E M*> h )- F <r\*> h )), (12) 

for y G Y . Inverse temperatures may be different for prior and likelihood. 
Thus, we may choose /?' ^ (5 in Eq. fll2|) and Eq. ([|). 

In Section |2l| we will discuss the maximum a posteriori approximation 
where an optimal h is found by maximizing the posterior p(h\f). Since 
maximizing the posterior means minimizing the posterior energy E(h\f) the 
latter plays the role of an error functional for h to be minimized. This is 
technically similar to the minimization of an regularized error functional as 
it appears in regularization theory or empirical risk minimization, and which 
is discussed in Section |2~5| . 

Let us have a closer look to the integral over model states h. The variables 
h represent the parameters describing the data generating probabilities or 
likelihoods p(y\x, h). In this paper we will mainly be interested in "nonpara- 
metric" approaches where the (x, y, /^-dependent numbers p(y\x, h) itself are 
considered to be the primary degrees of freedom which "parameterize" the 
model states h. Then, the integral over h is an integral over a set of real vari- 
ables indexed by x, y, under additional non-negativity and normalization 
condition. 

Hdp(y\x,h)) . (13) 




13 



Mathematical difficulties arise for the case of continuous x, y where p(h\f) 
represents a stochastic process, and the integral over h becomes a functional 
integral over (non-negative and normalized) functions p(y\x,h). For Gaus- 
sian processes such a continuum limit can be defined [51], [77], |223| , |143| , |l49 



while the construction of continuum limits for non-Gaussian processes is 
highly non-trivial (See for instance f5| |3?] |I0J, [23§ |T5|, ^28], ^ 0, [92 



for perturbative approaches or JF_7| f° r a non-perturbative -theory.) In 
this paper we will take the numerical point of view where all functions are 
considered to be finally discretized, so the /i-integral is well-defined ( "lattice 
regularization" |4], [200], |T60fl ). 

2.1.3 Posterior and likelihood 

Bayesian approaches require the calculation of posterior densities. Model 
states h are commonly specified by giving the data generating probabilities 
or likelihoods p{y\x, h). Posteriors are linked to likelihoods by Bayes' theorem 

p(m = mm*, ( u) 

which follows at once from the definition of conditional probabilities, i.e., 
p(A,B) = p(A\B)p(B) = p(B\A)p(A). Thus, one finds 

mnn\ P( D \ h ) P( h \ D o) p(y D \x D , h) p(h\D ) 

p{h\f)=p{h\D,D ) = — — = — j — 15 

p{D\D ) p{y D \x D ,D ) 

= UiP(xi,yi\h)p(h\D ) = UiP(yi\xi,h)p(h\D ) 

JdhUiP(xi,yi\h)p(h\D ) /d/i riiP(yi|^i,%(^|-Do)' 

using p(yo\xD, D , h) = p(yr>\xD, h) for the training data likelihood of h and 
p(h\Do, Xi) = p(h\Do). The terms of Eq. ( |i~5|) are in a Bayesian context often 
referred to as 

likelihood x prior 

posterior = . (17) 

evidence 



Eqs.(16) show that the posterior can be expressed equivalently by the joint 
likelihoods p(yi,Xi\h) or conditional likelihoods p(yi\xi,h). When working 
with joint likelihoods, a distinction between y and x variables is not neces- 
sary. In that case x can be included in y and skipped from the notation. 
If, however, p(x) is already known or is not of interest working with condi- 
tional likelihoods is preferable. Eqs. (p~5| , [T6D can be interpreted as updating 
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(or learning) formula used to obtain a new posterior from a given prior prob- 
ability if new data D arrive. 

In terms of energies Eq. (|16D reads, 



e -PY. i E{ 3H \x i ,H)-pE{H\D ) Z(Y D \ XD ,h)Z(H\D ) 



Z(Y D \x D ,h)Z{H\D ) J e -/3Ei E (^Kfc)-WIA))' 

where the same temperature 1/(3 has been chosen for both energies and the 
normalization constants are 

Z(Y D \x D ,h) = Y[[d yi e-^^ h \ (19) 

i 

Z(H\D ) = Jdhe- f}E W Do \ (20) 

The predictive density we are interested in can be written as the ratio of 
two correlation functions under po(h), 

p(y\xj) = <p(y\x,h)> H \f (21) 

(22) 



< p(y\x, h) UiP(yi\ 


Xi, h) >H\D 


< UiP{yi\ 


Xi, h) >H\D 



fdhp(y\x,h)e ^° mb 
fdh e~^ £c ° mb 

where < • • • >h\d denotes the expectation under the prior density po(h) 
= p(h\D ) and the combined likelihood and prior energy E com ^ collects the 
/i-dependent energy and free energy terms 

E comh = J2 E iVi\ x ii h) + E(h\D ) - F(Y D \x D , h), (24) 

i 

with 

F{Y D \x D ,h) = ~\nZ(Y D \x D ,h). (25) 

Going from Eq. (p2|) to Eq. (^) the normalization factor Z(H\Dq) appearing 
in numerator and denominator has been canceled. 

We remark that for continuous x and/or y the likelihood energy term 
E(yi\xi,h) describes an ideal, non-realistic measurement because realistic 
measurements cannot be arbitrarily sharp. Considering the function p(-\-, h) 
as element of a Hilbert space its values may be written as scalar product 
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p{x\y, h) = (v xy , p(-\-, h) ) with a function v xy being also an element in that 
Hilbert space. For continuous x and/or y this notation is only formal as v xy 
becomes unnormalizable. In practice a measurement of p(-\-, h) corresponds 
to a normalizable v xy = jdy fdx$(x,y)v xy where the kernel fl^x^y) has to 
ensure normalizability. (Choosing normalizable v xy as coordinates the Hilbert 



space of p(-\-,h) is also called a reproducing kernel Hilbert space ||180| , |112| , 
113| , [223| , |143|| .) The data terms then become 

ptyAxi, h) = f d yf dx Mx,y)p(y,x\h) (26) 

Jdy&i(x,y)p(y,x\h) 

The notation p(yi\xi, h) is understood as limit $(x,y) — > 8(x — Xi)5(y — yi) 
and means in practice that $(x,y) is very sharply centered. We will assume 
that the discretization, finally necessary to do numerical calculations, will 
implement such an averaging. 

2.1.4 Predictive density 

Within a Bayesian approach predictions about (e.g., future) events are based 
on the predictive probability density, being the expectation of probability for 
y for given (test) situation x, training data D and prior data Dq 



P(y\x, f) = p(y\x } D, D ) = J dhp(h\f) p(y\x, h) = < p(y\x, h) > H \ f . (27) 

Here < ••• >H\f denotes the expectation under the posterior p(h\f) = 
p(h\D,Do), the state of knowledge / depending on prior and training data. 
Successful applications of Bayesian approaches rely strongly on an adequate 
choice of the model space H and model likelihoods p(y\x, h). 

Note that p(y\x, f) is in the convex cone spanned by the possible states 
of Nature h G H, and typically not equal to one of these p(y\x, h). The situ- 
ation is illustrated in Fig. During learning the predictive density p(y\x, f) 
tends to approach the true p(y\x, h). Because the training data are random 
variables, this approach is stochastic. (There exists an extensive literature 
analyzing the stochastic properties of learning and generalization from a sta- 
tistical mechanics perspective |62|, |63|, |64|, |226| , [234| , |175|| ). 



2.1.5 Mutual information and learning 

The aim of learning is to generalize the information obtained from training 
data to non-training situations. For such a generalization to be possible, 
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Figure 2: The predictive density p(y\x, f) for a state of knowledge / = 
f(D, D ) is in the convex hull spanned by the possible states of Nature hi 
characterized by the likelihoods p(y\x, hi). During learning the actual pre- 
dictive density p(y\x, f) tends to move stochastically towards the extremal 
point p(y\x, /itrue) representing the "true" state of Nature. 

there must exist a, at least partially known, relation between the likelihoods 
p(yi\xi,h) for training and for non-training data. This relation is typically 
provided by a priori knowledge. 

One possibility to quantify the relation between two random variables 
yi and y 2 , representing for example training and non-training data, is to 
calculate their mutual information, defined as 

M(Y 1 ,Y 2 ) — £ P(yi,y 2 )m (28) 
yi eY^ y2 eY 2 P(yi)p(y2) 

It is also instructive to express the mutual information in terms of (average) 
information content or entropy, which, for a probability function p(y), is 
defined as 

H(Y) = -\nJ2p(y)^p(y)- (29) 

y& 

We find 

M(Y 1 , Y 2 ) = H(Y 1 ) + H{Y 2 ) - H(Y 1 , Y 2 ), (30) 

meaning that the mutual information is the sum of the two individual en- 
tropies diminished by the entropy common to both variables. 
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To have a compact notation for a family of predictive densities p(jji\xi, f) 
we choose a vector x = (xi,x 2 , • • •) consisting of all possible values Xi and 
corresponding vector y = (1/1,2/2, • • •)) so we can wr ite 

p(y\x, f) = p(yi, 2/2, •■ ■ \x 1} x 2 , ■ ■ ■ , /). (31) 

We now would like to characterize a state of knowledge / corresponding to 
predictive density p(y\x,f) by its mutual information. Thus, we generalize 
the definition fl2"H| ) from two random variables to a random vector y with 
components j/j, given vector x with components Xi and obtain the conditional 
mutual information 

*pt./) - / (n »)^/)%g ff < 32 > 



or 



M(Y\x,f)=(Jdy i H(Y i \xJ)-H(Y\x,f)y (33) 
in terms of conditional entropies 

#(Y|x, /) = - / /) Mylar, /)■ (34) 

In case not a fixed vector x is given, like for example x = (xi,x%, ■ • •), but a 
density p(x), it is useful to average the conditional mutual information and 
conditional entropy by including the integral Jdxp(x) in the above formulae. 
It is clear from Eq. (32) that predictive densities which factorize 

P(y\x,f) = ]Jp(yi\xi,f), (35) 

i 

have a mutual information of zero. Hence, such factorial states do not allow 
any generalization from training to non-training data. A special example are 
the possible states of Nature or pure states h, which factorize according to 
the definition of our model 

P(y\x, h) = Y[p(yi\xi, h). (36) 

i 

Thus, pure states do not allow any further generalization. This is consistent 
with the fact that pure states represent the natural endpoints of any learning 
process. 
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It is interesting to see, however, that there are also other states for which 
the predictive density factorizes. Indeed, from Eq. ([36]) it follows that any 
(prior or posterior) probability p(h) which factorizes leads to a factorial state, 



This means generalization, i.e., (non-local) learning, is impossible when 
starting from a factorial prior. 

A factorial prior provides a very clear reference for analyzing the role 
of a-priori information in learning. In particular, with respect to a prior 
factorial in local variables x i: learning may be decomposed into two steps, 
one increasing, the other lowering mutual information: 

1. Starting from a factorial prior, new non-local data D (typically called 
a priori information) produce a new non-factorial state with non-zero 
mutual information. 

2. Local data D (typically called training data) stochastically reduce the 
mutual information. Hence, learning with local data corresponds to a 
stochastic decay of mutual information. 

Pure states, i.e., the extremal points in the space of possible predictive 
densities, do not have to be deterministic. Improving measurement devices, 
stochastic pure states may be further decomposed into finer components g, 
so that 



Imposing a non-factorial prior p(g) on the new, finer hypotheses g enables 
again non-local learning with local data, leading asymptotically to one of 
the new pure states p(yi\xi, g). 

Let us exemplify the stochastic decay of mutual information by a simple 
numerical example. Because the mutual information requires the integration 
over all t/i variables we choose a problem with only two of them, y a and 
yb corresponding to two x values x a and Xb- We consider a model with 
four states of Nature hi, 1 < I < 4, with Gaussian likelihood p(y\x,h) = 
(v^ra) -1 exp (— (y — hi(x)) 2 / (2a 2 )) and local means hi(xj) = ±1. 

Selecting a "true" state of Nature h, we sample 50 data points Di = 
(xi, yi) from the corresponding Gaussian likelihood using p(x a ) = p(x&) = 



p(h) = Y[p(h(xi)) p(y\x, f) = Y[p(yi\xi, f). 



(37) 




(38) 
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0.5. Then, starting from a given, factorial or non-factorial, prior p(h\D ) we 
sequentially update the predictive density, 

4 

p(y\x, /(A+i, • - • , A))) = J2p(y\ x i h i) p( h i\ A+i, ■ • • , A), (39) 



by calculating the posterior 

p(hi\D i+1 ,---,D ) = ■ — ^ — . (40) 

A • - •, Do) 

It is easily seen from Eq. (^) that factorial states remain factorial under 
local data. 

Fig. ^| shows that indeed the mutual information decays rapidly. Depend- 
ing on the training data, still the wrong hypothesis hi may survive the decay 
of mutual information. Having arrived at a factorial state, further learning 
has to be local. That means, data points for Xi can then only influence the 
predictive density for the corresponding yi and do not allow generalization 
to the other yj with j ^ i. 

For a factorial prior p(hi) = p(hi(x a ))p(hi(xb)) learning is thus local from 
the very beginning. Only very small numerical random fluctuations of the 
mutual information occur, quickly eliminated by learning. Thus, the predic- 
tive density moves through a sequence of factorial states. 



2.2 Bayesian decision theory 
2.2.1 Loss and risk 

In Bayesian decision theory a set A of possible actions a is considered, to- 
gether with a function l(x, y, a) describing the loss I suffered in situation x if 
y appears and action a is selected |16|, |127| , |182| , |197|| . The loss averaged over 
test data x, y, and possible states of Nature h is known as expected risk, 



r(a, f) 



dx dyp(x) p(y\x, f) l(x, y, a). 

< K x iVi a ) >X,Y\f 

< r(a, h) > H \f 



(41) 

(42) 
(43) 



where < • • • >x,Y\f denotes the expectation under the joint predictive density 
P(x,y\f) = p(x)p(y\xj) and 



r(a, h) = j dx dyp(x) p(y\x, h) l(x, y, a) 



(44) 
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Figure 3: The decay of mutual information during learning: Model with 4 
possible states hi representing Gaussian likelihoods p(yi\xi, hi) with means ±1 
for two different Xi values. Shown are posterior probabilities p(hi\f) (a, c, e, 
on the left hand side, the posterior of the true hi is shown by a thick line) and 
mutual information M(y) (b, d, /, on the right hand side) during learning 
50 training data, (a, b): The mutual information decays during learning 
and becomes quickly practically zero, (c, d): For "unlucky" training data 
the wrong hypothesis hi can dominate at the beginning. Nevertheless, the 
mutual information decays and the correct hypothesis has finally to be found 
through "local" learning, (e, /): Starting with a factorial prior the mutual 
information is and remains zero, up to artificial numerical fluctuations. For 
(e, /) the same random data have been used as for (c, d). 
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The aim is to find an optimal action a* 

a* = argmin aej4 r(a, /). 



(45) 



2.2.2 Loss functions for approximation 

Log— loss: A typical loss function for density estimation problems is the log- 
loss 

l(x, y, a) = -foi(x) \np(y\x, a) + b 2 (x, y) (46) 

with some a-independent &i(x) > 0, b 2 (x,y) and actions a describing proba- 
bility densities 

J dyp(y\x, a) = 1, Vx G X, Va G A. (47) 
Choosing b 2 (x,y) = p(y\x, f) and b\(x) = 1 gives 

r(a,/) = / dxdyp(x)p(y\x, /) In ' (48) 

" <hl p(^M > w (49) 

= < KL(p(y|x,/), p(j/|z,o)) > x , (50) 

which shows that minimizing log-loss is equivalent to minimizing the {x— 
averaged) Kullback-Leibler entropy KL(p(y|x, /), p(y\x, a) ) [|122j , |123| , p~3| , [46 . 



While the paper will concentrate on log-loss we will also give a short 
summary of loss functions for regression problems. (See for example |TB|, |197| 
for details.) Regression problems are special density estimation problems 
where the considered possible actions are restricted to y-independent func- 
tions a{x). 

Squared— error loss: The most common loss function for regression prob- 



lems (see Sections |3/7], |3.7.2|) is the squared-error loss. It reads for one- 
dimensional y 

l(x, y, a) = 6i(x) (y - a(x)f + b 2 (x, y), (51) 

with arbitrary b\{x) > and b 2 (x,y). In that case the optimal function a(x) 
is the regression function of the posterior which is the mean of the predictive 
density 

a*(x) = J dyyp(y\x, f) = <y >y\ x J • (52) 
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This can be easily seen by writing 

(y - a(x)f = (y - <y >y\xJ + < V >Y\x,f ~ a {%)f (53) 
= (y-<y>Y\x,f) + (a(x) - < y > Y \x,f) 

-2{y-<y > Y \xj) (a(x) -<y > Y \x,}f , (54) 

where the first term in ([54]) is independent of a and the last term vanishes 
after integration over y according to the definition of < y >Y\xj- Hence, 

r(a, /) = J dx bi(x)p(x) (a(x) — < y >y\x,f) + const. (55) 

This is minimized by a(x) =< y >Y\xj- Notice that for Gaussian p(y\x, a) 
with fixed variance log-loss and squared-error loss are equivalent. For multi- 
dimensional y one-dimensional loss functions like Eq. (|5"T| ) can be used when 
the component index of y is considered part of the x-variables. Alternatively, 
loss functions depending explicitly on multidimensional y can be defined. For 
instance, a general quadratic loss function would be 

l(x,y,a) = Y^iVk - a k )K(k,k')(y k > - a k '(x)). (56) 

k,k> 

with symmetric, positive definite kernel K(/c, k'). 
Absolute loss: For absolute loss 

l(x,y,a) = h(x)\y - a(x)\ + b 2 (x,y), (57) 
with arbitrary bi(x) > and b 2 (x,y). The risk becomes 

r r a ( x ) 
r(a,f) = dxb 1 (x)p(x) dy (a(x) - y)p(y\x, f) 



J — oo 

dxb\[x)p{x) \ dy (y — a(x)) p(y\x, f) + const. (58) 

J a(x) 

f f a ( x ) 

= 2 dxbi(x)p(x) / dy (a(x) — y) p(y\x, f) + const.', (59) 

J Jm(x) 

where the integrals have been rewritten as = J!!^ + S^l and = 
lalx)^ + Jm(x) introducing a median function m(x) which satisfies 

/m(x) ] 
dyp{y\x,f) = -,VxeX, (60) 
-oo / 
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so that 



(pm{x) roo \ 

/ dyp(y\x, /) - / dyp(y\x, /) J = 0, Vx G X (61) 
J—oo Jm(x) J 

Thus the risk is minimized by any median junction m(x). 

5— loss and 0—1 loss : Another possible loss function, typical for classifica- 
tion tasks (see Section |3.8| ), like for example image segmentation [ 150 1 , is the 
5-loss for continuous y or 0-1-loss for discrete y 

l(x, y, a) = -bi(x)6 (y - a(x)) + b 2 (x, y), (62) 

with arbitrary bi(x) > and b2{x,y). Here 5 denotes the Dirac ^-functional 
for continuous y and the Kronecker 5 for discrete y. Then, 

r(a, /) = Jdx b± (x)p(x) p( y = a(x) \x, f) + const., (63) 

so the optimal a corresponds to any mode function of the predictive density. 
For Gaussians mode and median are unique, and coincide with the mean. 

2.2.3 General loss functions and unsupervised learning 

Choosing actions a in specific situations often requires the use of specific 
loss functions. Such loss functions may for example contain additional terms 
measuring costs of choosing action a not related to approximation of the 
predictive density. Such costs can quantify aspects like the simplicity, imple- 
mentability, production costs, sparsity, or understandability of action a. 

Furthermore, instead of approximating a whole density it often suffices 
to extract some of its features, like identifying clusters of similar y-values, 
finding independent components for multidimensional y, or mapping to an 
approximating density with lower dimensional x. This kind of exploratory 
data analysis is the Bayesian analogue to unsupervised learning methods. 
Such methods are on one hand often utilized as a preprocessing step but 
are, on the other hand, also important to choose actions for situations where 
specific loss functions can be defined. 

From a Bayesian point of view general loss functions require in general 



an explicit two-step procedure [|131|| : 1. Calculate (an approximation of) the 
predictive density, and 2. Minimize the expectation of the loss function under 
that (approximated) predictive density. (Empirical risk minimization, on the 
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other hand, minimizes the empirical average of the (possibly regularized) loss 
function, see Section p75| .) (For a related example see for instance [|138|| .) 

For a Bayesian version of cluster analysis, for example, partitioning a pre- 
dictive density obtained from empirical data into several clusters, a possible 
loss function is 

l(x,y,a) = (y - a(x,y)) 2 , (64) 

with action a(x,y) being a mapping of y for given x to a finite number of 
cluster centers (prototypes). Another example of a clustering method based 
on the predictive density is Fukunaga's valley seeking procedure . 



For multidimensional of actions a(P x x, y) can be chosen de- 

pending only on a (possibly adaptable) lower dimensional projection of x. 

For multidimensional y with components yi it is often useful to identify 
independent components. One may look, say, for a linear mapping y = 
ISAy minimizing the correlations between different components of the 'source' 
variables y by minimizing the loss function 

l(y,y',M) = ^y ] , (65) 

with respect to M under the joint predictive density for y and y' given 
x, x', D, D . This includes a Bayesian version of blind source separation (e.g. 
applied to the so called cocktail party problem [14, |7|]), analogous to the 
treatment of Molgedey and Schuster ||159| . Interesting projections of mul- 
tidimensional y can for example be found by projection pursuit techniques 
E9L HTM TO GDBII. 



2.3 Maximum A Posteriori Approximation 



In most applications the (usually very high or even formally infinite dimen- 
sional) /i-integral over model states in Eq. (^) cannot be performed exactly. 
The two most common methods used to calculate the h integral approxi- 
mately are Monte Carlo integration [[[51], [9l|, |J5|, fL9|, [[§ [70J [195|, fZIJ [TTJ, 
p33| , |69| , p. 6 71 , |177] , |198| , |168|| and saddle point approximation []16 , [45l [30| , |169| , 
P~7| , ^44| , |197| , 59[ ^(j, |131| . The latter approach will be studied in the following. 

For that purpose, we expand E comb of Eq. (^4|) with respect to h around 
some h* 

E rnmh (h) = e W) E(h)\ (66) 



h=h* 



E comh (h*) + (Ah,V(h*)) + -(A/i,H(/i*)A/i) + ■ 
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with Ah = (h — h*), gradient V (not acting on Ah), Hessian H, and round 
brackets (•••,•••) denoting scalar products. In case p(y\x, h) is parameterized 
independently for every x, y the states h represent a parameter set indexed 
by x and y, hence 



V(h* 



8E comh (h) 



Sh(x,y) 



SE comh (p(y'\x',h)) 



H(/i* 



5 2 E comh (h) 



5h(x, y)5h(x', y') 



h=h* 



h=h* 



Sp(y\x, h) 



h=h* 



5 2 E comh (p(y"\x",h)) 



5p(y\x, h)5p(y'\x', h) 



(67) 



(68) 



h=h* 



are functional derivatives 
x, y) and for example 



|106| . |29| , p6| (or partial derivatives for discrete 



(Ah, V(/i*)) = dxdy (h(x, y) - h*{x, y)) V(h*)(x, y). 



(69) 

Choosing h* to be the location of a local minimum of E comp (h) the linear 
term in ( |66"]) vanishes. The second order term includes the Hessian and 
corresponds to a Gaussian integral over h which could be solved analytically 



Jdhe-^ Ah > nA V = 7r*/H(detH)-i 



(70) 



for a (i-dimensional /i-integral. However, using the same approximation for 
the /i-integrals in numerator and denominator of Eq. (p3]), expanding then 
also p(y\x, h) around h*, and restricting to the first (/i-independent) term 
p(y\x, h*) of that expansion, the factor ( |7D[ ) cancels, even for infinite d. (The 
result is the zero order term of an expansion of the predictive density in 
powers of 1//3. Higher order contributions can be calculated by using Wick's 
theorem g5|, [30[ [169], |4|, [109], [160], |13|.) The final approximative result for 
the predictive density fl2~7|) is very simple and intuitive 



p{y\xj) ~p(y\x, h*), 



(71) 



with 



h*= aigmm h&H E comb = argmax heH p{h\f) = aigmax h&H p(y D \x D , h)p(h\D ). 

(72) 

The saddle point (or Laplace) approximation is therefore also called Max- 
imum A Posteriori Approximation (MAP). Notice that the same h* also 
maximizes the integrand of the evidence of the data yr> 



p(yD\x D ,D ) = J dhp(y D \x D ,h)p(h\D ). 



(73) 
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This is due to the assumption that p(y\x, h) is slowly varying at the stationary 
point and has not to be included in the saddle point approximation for the 
predictive density For (functional) differentiable E com f, Eq. ([72]) yields the 
stationarity equation, 

= (74) 

8h(x,y) 

The functional -E C omb including training and prior data (regularization, sta- 
bilizer) terms is also known as (regularized) error functional for h. 

In practice a saddle point approximation may be expected useful if the 
posterior is peaked enough around a single maximum, or more general, if the 
posterior is well approximated by a Gaussian centered at the maximum. For 
asymptotical results one would have to require 

~ pYl L (.yi\ x i> h )> ( 75 ) 

to become /3-independent for (3 — > oo with some (3 being the same for the 
prior and data term. (See ||(], |237| |). If for example - J2i L(yi\xi, h) converges 



for large number n of training data the low temperature limit — ^ can 
be interpreted as large data limit n — > oo, 



nE comh = n l—y2L(yi\xi,h) + -E(h\D )\ . 



(76) 



Notice, however, the factor 1/n in front of the prior energy. For Gaussian 
p(y\x, h) temperature 1/(3 corresponds to variance a 2 



^£comb = ^2(5 HiVi ~ Hx,)) 2 + a 2 E(h\D )^j . (77) 



a 2 



For Gaussian prior this would require simultaneous scaling of data and prior 
variance. 

We should also remark that for continuous x,y the stationary solution h* 
needs not to be a typical representative of the process p(h\f). A common 
example is a Gaussian stochastic process p(h\f) with prior energy E(h\D ) 
related to some smoothness measure of h expressed by derivatives of p(y\x, h). 
Then, even if the stationary h* is smooth, this needs not to be the case for 
a typical h sampled according to p(h\f). For Brownian motion, for instance, 
a typical sample path is not even differentiable (but continuous) while the 
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stationary path is smooth. Thus, for continuous variables only expressions 
like Jdhe~ /3E ^ can be given an exact meaning as a Gaussian measure, de- 
fined by a given covariance with existing normalization factor, but not the 
expressions dh and E(h) alone 0, |5|, |223|, [IK], [83|, [143]. 



Interestingly, the stationary h* yielding maximal posterior p(h\f) is not 
only useful to obtain an approximation for the predictive density p(y\x, f) 
but is also the optimal solution a* for a Bayesian decision problem with 
log-loss and a G A = H . 



Indeed, for a Bayesian decision problem with log-loss §£b\ ) 

argmin a6 ^r(a, h) = h, (78) 

and analogously, 

argmin aeF r(a,/) = /. (79) 

This is proved as follows: Jensen's inequality states that 

dyp(y)g(g(y)) > g( / dyp(y)q(y)), (80) 



for any convex function g and probability p(y) > with jdyp(y) = 1. Thus, 
because the logarithm is concave 

- / dyp(y\x, h) In > - In / dyp(y\x, h)^^- = (81) 

J p{y\x,h) J p{y\x,h) 



dyp(y\x, h) \np(y\x } a) > - J dyp(y\x, h) \np(y\x, h), (82) 
with equality for a = h. Hence 

r(a,h) = — J dx Jdy p(x)p(y | x, h) (bi (x) In p(y\x, a) + 62(2, y)) (83) 

= — j ' dx p(x)bi(x) Jdyp(y\x, h) \n.p{y\x, a) + const. (84) 

> — J dx p(x)bi(x) Jdy p(y\x, h) \n.p{y\x, h) + const. (85) 

= r(h,h), (86) 

with equality for a = h. For a 6 F replace h G H by / £ F. This proves 
Eqs. (M) and 
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2.4 Normalization, non— negativity, and specific priors 

Density estimation problems are characterized by their normalization and 
non-negativity condition for p(y\x,h). Thus, the prior density p(h\Do) can 
only be non-zero for such h for which p(y\x,h) is positive and normalized 
over y for all x. (Similarly, when solving for a distribution function, i.e., the 
integral of a density, the non-negativity constraint is replaced by monotonic- 
ity and the normalization constraint by requiring the distribution function 
to be 1 on the right boundary.) While the non-negativity constraint is local 
with respect to x and y, the normalization constraint is nonlocal with respect 
to y. Thus, implementing a normalization constraint leads to nonlocal and 
in general non-Gaussian priors. 

For classification problems, having discrete y values (classes), the nor- 
malization constraint requires simply to sum over the different classes and 
a Gaussian prior structure with respect to the x-dependency is not altered 
[231 1 . For general density estimation problems, however, i.e., for continu- 



ous y, the loss of the Gaussian structure with respect to y is more severe, 
because non-Gaussian functional integrals can in general not be performed 
analytically. On the other hand, solving the learning problem numerically 
by discretizing the y and x variables, the normalization term is typically not 
a severe complication. 

To be specific, consider a Maximum A Posteriori Approximation, mini- 
mizing 

(3E comh = -J2 L (V*K h) + PE(h\D Q ), (87) 

i 

where the likelihood free energy F(Yd\xd, h) is included, but not the prior 
free energy F(H\D ) which, being /i-independent, is irrelevant for minimiza- 
tion with respect to h. The prior energy /3E(h\D Q ) has to implement the 
non-negativity and normalization conditions 

Z x (x, h) = J dy i p{y i \x il h) = 1, Vx; G X i: V/i G H (88) 
p(Vi\xi, h) > 0, Vy< G Yi, Vx, G X t , VA G H. (89) 

It is useful to isolate the normalization condition and non-negativity con- 
straint defining the class of density estimation problems from the rest of the 
problem specific priors. Introducing the specific prior information Dq so that 
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D = {D , normalized, positive}, we have 

~ p(nonn.,pos.|fe)p(fe|.Do) 

p(/i|A), norm., pos.) = r~— , (90) 

p(norm., pos. \D ) 

with deterministic, _Do~ m dependent 

p(norm., pos. \h) — p(norm., pos.|/t, D ) (91) 
= p(norm.|%(pos» = 5(Z X - l)Y[e(p(y\x,hj), (92) 

xy 

and step function 0. ( The density p(novm.\h) is normalized over all pos- 
sible normalizations of p(y\x,h), i.e., over all possible values of Zx, and 
p(pos.\h) over all possible sign combinations.) The /i-independent denomi- 
nator p(norm., pos. \D ) can be skipped for error minimization with respect 
to h. We define the specific prior as 

p(h\D ) oc e - EWD °\ (93) 

In Eq. ( p3|) the specific prior appears as posterior of a /i-generating pro- 
cess determined by the parameters Do. We will call therefore Eq. (|93|) the 
posterior form of the specific prior. Alternatively, a specific prior can also be 
in likelihood form 

p(Do, /i | norm., pos.) = p(D \h) p(/i|norm., pos.). (94) 

As the likelihood p(D \h) is conditioned on h this means that the normal- 
ization Z = JdDoe~ E ( D °^ remains in general /i-dependent and must be in- 
cluded when minimizing with respect to h. However, Gaussian specific priors 
with /i-independent covariances have the special property that according to 
Eq. ( |T0"D likelihood and posterior interpretation coincide. Indeed, represent- 
ing Gaussian specific prior data D by a mean function £g and covariance 
K (analogous to standard training data in the case of Gaussian regression, 
see also Section |3.5| ) one finds due to the fact that the normalization of a 
Gaussian is independent of the mean (for uniform (meta) prior p(h)) 

e -3(*-*fi .K(h-t fio )) 

mD ° ] = J<tte -W(»-v> < 95 > 

-i(fc-tfi .K(A-t flo )) 
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Thus, for Gaussian p{t^ \ h, K) with /i-independent normalization the specific 
prior energy in likelihood form becomes analogous to Eq. fl93|) 

pfeJft.K)^-^ 1 ^' (97) 

and specific prior energies can be interpreted both ways. 
Similarly, the complete likelihood factorizes 

p(D , norm., pos.\h) = p(norm.,pos. \h)p(D \h). (98) 

According to Eq. (^2j) non-negativity and normalization conditions are 
implemented by step and <5-functions. The non-negativity constraint is only 
active when there are locations with p(y\x, h) = 0. In all other cases the gra- 
dient has no component pointing into forbidden regions. Due to the combined 
effect of data, where p(y\x, h) has to be larger than zero by definition, and 
smoothness terms the non-negativity condition for p(y\x,h) is usually (but 
not always) fulfilled. Hence, if strict positivity is checked for the final solu- 
tion, then it is not necessary to include extra non-negativity terms in the er- 
ror (see Section [3.2.1|) . For the sake of simplicity we will therefore not include 
non-negativity terms explicitly in the following. In case a non-negativity 
constraint has to be included this can be done using Lagrange multipliers, or 
alternatively, by writing the step functions in p(pos.\h) oc Yl xy Q(p(y\x, h)) 

/•OO /"OO 

G{x-a)= d£ drie^- x \ (99) 

J a J—oo 

and solving the ^-integral in saddle point approximation (See for example 

Including the normalization condition in the prior po(h\Do) in form of a 
5-functional results in a posterior probability 

P (h\f)= e Y,Myi\^h)-E{h\b )+c{H\b ) -Q g f f dye L(y\x,h) _ A (1Q0) 

with constant c(H\D ) = — In Z(h\D ) related to the normalization of the 
specific prior e - E ( h \ D o)_ Writing the (^-functional in its Fourier representation 




(101) 
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i.e., 



$([dyeW*M _ i) = JL r dAx ( x) e A JC (.)(i-/*^W-«) | (1Q2) 

J 2717 J-ioo 

and performing a saddle point approximation with respect to Ax{%) (which 
is exact in this case) yields 

P(h\f) = e E l i >(^l^'' l )-- B ( /l l^o)+ £ ( H l^o)+/ d:rA ^( :c )( 1 -/^ eL(!/|a: '' l) ). (103) 

This is equivalent to the Lagrange multiplier approach. Here the stationary 
Ax{x) is the Lagrange multiplier vector (or function) to be determined by 
the normalization conditions for p(y\x, h) = e L ^ x ' h \ Besides the Lagrange 
multiplier terms it is numerically sometimes useful to add additional terms 
to the log-posterior which vanish for normalized p(y\x, h). 



2.5 Empirical risk minimization 

In the previous sections the error functionals we will try to minimize in the 
following have been given a Bayesian interpretation in terms of the log- 
posterior density. There is, however, an alternative justification of error 
functionals using the Frequentist approach of empirical risk minimization 
2l9j ggg , p21j . 



Common to both approaches is the aim to minimize the expected risk for 
action a 

r(aj)= [dxdyp{x,y\f{D,D°))l{x,y,a). (104) 



The expected risk, however, cannot be calculated without knowledge of the 
true p(x,y\f). In contrast to the Bayesian approach of modeling p(x,y\f) 
the Frequentist approach approximates the expected risk by the empirical 
risk 

E(a)=r(a,f) = '£l(x i ,y i ,a), (105) 

i 

i.e., by replacing the unknown true probability by an observable empirical 
probability. Here it is essential for obtaining asymptotic convergence results 
to assume that training data are sampled according to the true p{x,y\f) 
|219| , |52| , |189| , |127| , [221|| . Notice that in contrast in a Bayesian approach the 



density p(x{) for training data D does according to Eq. (|16|) not enter the 
formalism because D enters as conditional variable. For a detailed discussion 
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of the relation between quadratic error functionals and Gaussian processes 
see for example g7|, [TSg, [181], 012], [ll3|, [150], |23| g. 



From that Frequentist point of view one is not restricted to logarithmic 
data terms as they arise from the posterior-related Bayesian interpretation. 
However, like in the Bayesian approach, training data terms are not enough to 
make the minimization problem well defined. Indeed this is a typical inverse 
problem ||219| , 115 , |221|| which can, according to the classical regularization 
approach [ |215| , |216| , 162 1, be treated by including additional regularization 
(stabilizer) terms in the loss function I. Those regularization terms, which 
correspond to the prior terms in a Bayesian approach, are thus from the 
point of view of empirical risk minimization a technical tool to make the 
minimization problem well defined. 

The empirical generalization error for a test or validation data set inde- 
pendent from the training data D, on the other hand, is measured using only 
the data terms of the error functional without regularization terms. In empir- 
ical risk minimization this empirical generalization error is used, for example, 
to determine adaptive (hyper-) parameters of regularization terms. A typi- 
cal example is a factor multiplying the regularization terms controlling the 
trade-off between data and regularization terms. Common techniques using 
the empirical generalization error to determine such parameters are cross- 
validation or bootstrap like techniques | |163| , |225 
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From a strict Bayesian point of view those parameters would have to be 
integrated out after defining an appropriate prior [To", 146, 148, 



2.6 Interpretations of Occam's razor 

The principle to prefer simple models over complex models and to find an 
optimal trade-off between fitting data and model complexity is often referred 
to as Occam's razor (William of Occam, 1285-1349). Regularization terms, 
penalizing for example non-smooth ("complex") functions, can be seen as 
an implementation of Occam's razor. 

The related phenomena appearing in practical learning is called over- 
fitting 
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)f| p4}| . Indeed, when studying the generalization behavior of 
trained models on a test set different from the training set, it is often found 
that there is an optimal model complexity. Complex models can due to 
their higher flexibility achieve better performance on the training data than 
simpler models. On a test set independent from the training set, however, 
they can perform poorer than simpler models. 
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Notice, however, that the Bayesian interpretation of regularization terms 
as (a priori) information about Nature and the Frequentist interpretation 
as additional cost terms in the loss function are not equivalent. Complexity 
priors reflects the case where Nature is known to be simple while complex- 
ity costs express the wish for simple models without the assumption of a 
simple Nature. Thus, while the practical procedure of minimizing an error 
functional with regularization terms appears to be identical for empirical risk 
minimization and a Bayesian Maximum A Posteriori Approximation, the un- 
derlying interpretation for this procedure is different. In particular, because 
the Theorem in Section |2.3| holds only for log-loss, the case of loss functions 
differing from log-loss requires from a Bayesian point of view to distinguish 
explicitly between model states h and actions a. Even in saddle point ap- 
proximation, this would result in a two step procedure, where in a first step 
the hypothesis h*, with maximal posterior probability is determined, while 
the second step minimizes the risk for action a e A under that hypothesis 



h* 131 



2.7 A priori information and a posteriori control 

Learning is based on data, which includes training data as well as a pri- 
ori data. It is prior knowledge which, besides specifying the space of local 
hypothesis, enables generalization by providing the necessary link between 
measured training data and not yet measured or non-training data. The 
strength of this connection may be quantified by the mutual information of 



training and non-training data, as we did in Section |2. 1.5 . 

Often, the role of a priori information seems to be underestimated. There 
are theorems, for example, proving that asymptotically learning results be- 
come independent of a priori information if the number of training data goes 
to infinity. This, however, is correct only if the space of hypotheses h is al- 
ready sufficiently restricted and if a priori information means knowledge in 
addition to that restriction. 

In particular, let us assume that the number of potential test situations 
x, is larger than the number of training data one is able to collect. As the 
number of actual training data has to be finite, this is always the case if 
x can take an infinite number of values, for example if a; is a continuous 
variable. The following arguments, however, are not restricted to situations 
were one considers an infinite number of test situation, we just assume that 
their number is too large to be completely included in the training data. 
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If there are x values for which no training data are available, then learn- 
ing for such x must refer to the mutual information of such test data and 
the available training data. Otherwise, training would be useless for these 
test situations. This also means, that the generalization to non-training 
situations can be arbitrarily modified by varying a priori information. 

To make this point very clear, consider the rather trivial situation of 
learning a deterministic function h(x) for a x variable which can take only 
two values x\ and X2, from which only one can be measured. Thus, hav- 
ing measured for example h(x\) = 5, then "learning" h(x2) is not possible 
without linking it to h(x\). Such prior knowledge may have the form of a 
"smoothness" constraint, say \h{x\)—h{x2)\ < 2 which would allow a learning 
algorithm to "generalize" from the training data and obtain 3 < h^x^) < 7. 
Obviously, arbitrary results can be obtained for h{%2) by changing the prior 
knowledge. This exemplifies that generalization can be considered as a mere 
reformulation of available information, i.e., of training data and prior knowl- 
edge. Except for such a rearrangement of knowledge, a learning algorithm 
does not add any new information to the problem. (For a discussion of the 
related "no-free-lunch" theorems see ||235| , [23(| . ) 

Being extremely simple, this example nevertheless shows a severe prob- 
lem. If the result of learning can be arbitrary modified by a priori informa- 
tion, then it is critical which prior knowledge is implemented in the learning 
algorithm. This means, that prior knowledge needs an empirical foundation, 
just like standard training data have to be measured empirically. Otherwise, 
the result of learning cannot expected to be of any use. 

Indeed, the problem of appropriate a priori information is just the old 
induction problem, i.e., the problem of learning general laws from a finite 
number of observations, as already been discussed by the ancient Greek 
philosophers. Clearly, this is not a purely academic problem, but is ex- 
tremely important for every system which depends on a successful control 
of its environment. Modern applications of learning algorithms, like speech 
recognition or image understanding, rely essentially on correct a priori in- 
formation. This holds especially for situations where only few training data 
are available, for example, because sampling is very costly. 

Empirical measurement of a priori information, however, seems to be 
impossible. The reason is that we must link every possible test situation to 
the training data. We are not able to do this in practice if, as we assumed, the 
number of potential test situations is larger than the number of measurements 
one is able to perform. 
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Take as example again a deterministic learning problem like the one dis- 
cussed above. Then measuring a priori information might for example be 
done by measuring (e.g., bounds on) all differences h(xi) — h(xi). Thus, 
even if we take the deterministic structure of the problem for granted, the 
number of such differences is equal to the number of potential non-training 
situations xi we included in our model. Thus, measuring a priori information 
does not require fewer measurements than measuring directly all potential 
non-training data. We are interested in situations where this is impossible. 

Going to a probabilistic setting the problem remains the same. For exam- 
ple, even if we assume Gaussian hypotheses with fixed variance, measuring 
a complete mean function h(x), say for continuous x, is clearly impossible 
in practice. The same holds thus for a Gaussian process prior on h. Even 
this very specific prior requires the determination of a covariance and a mean 
function (see Chapter ^J). 

As in general empirical measurement of a priori information seems to be 
impossible, one might thus just try to guess some prior. One may think, for 
example, of some "natural" priors. Indeed, the term "a priori" goes back 
to Kant ||111|| who assumed certain knowledge to be necessarily be given "a 
priori" without reference to empirical verification. This means that we are 
either only able to produce correct prior assumptions, for example because 
incorrect prior assumptions are "unthinkable", or that one must typically 
be lucky to implement the right a priori information. But looking at the 
huge number of different prior assumptions which are usually possible (or 
"thinkable"), there seems no reason why one should be lucky. The question 
thus remains, how can prior assumptions get empirically verified. 

Also, one can ask whether there are "natural" priors in practical learning 
tasks. In Gaussian regression one might maybe consider a "natural" prior 
to be a Gaussian process with constant mean function and smoothness- 
related covariance. This may leave a single regularization parameter to be 
determined for example by cross-validation. Formally, one can always even 
use a zero mean function for the prior process by subtracting a base line 
or reference function. Thus does, however, not solve the problem of finding 
a correct prior, as now that reference function has to be known to relate 
the results of learning to empirical measurements. In principle any function 
could be chosen as reference function. Such a reference function would for 
example enter a smoothness prior. Hence, there is no "natural" constant 
function and from an abstract point of view no prior is more "natural" than 
any other. 
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Formulating a general law refers implicitly (and sometimes explicitly) to 
a "ceteris paribus" condition, i.e., the constraint that all relevant variables, 
not explicitly mentioned in the law, are held constant. But again, verifying 
a "ceteris paribus" condition is part of an empirical measurement of a priori 
information and by no means trivial. 

Trying to be cautious and use only weak or "uninformative" priors does 
also not solve the principal problem. One may hope that such priors (which 
may be for example an improper constant prior for a one-dimensional real 
variable) do not introduce a completely wrong bias, so that the result of 
learning is essentially determined by the training data. But, besides the 
problem to define what exactly an uninformative prior has to be, such priors 
are in practice only useful if the set of possible hypothesis is already suffi- 
ciently restricted, so "the data can speak for themselves" f69|. Hence, the 
problem remains to find that priors which impose the necessary restrictions, 
so that uninformative priors can be used. 

Hence, as measuring a priori information seems impossible and finding 
correct a priori information by pure luck seems very unlikely, it looks like also 
successful learning is impossible. It is a simple fact, however, that learning 
can be successful. That means there must be a way to control a priori 
information empirically. 

Indeed, the problem of measuring a priori information may be artificial, 
arising from the introduction of a large number of potential test situations 
and correspondingly a large number of hidden variables h (representing what 
we call "Nature") which are not all observable. 

In practice, the number of actual test situations is also always finite, just 
like the number of training data has to be. This means, that not all poten- 
tial test data but only the actual test data must be linked to the training 
data. Thus, in practice it is only a finite number of relations which must 
be under control to allow successful generalization. (See also Vapnik's dis- 
tinction between induction and transduction problems. |[221|| : In induction 
problems one tries to infer a whole function, in transduction problems one is 
only interested in predictions for a few specific test situations.) 

This, however, opens a possibility to control a priori information em- 
pirically. Because we do not know which test situation will occur, such an 
empirical control cannot take place at the time of training. This means a 
priori information has to be implemented at the time of measuring the test 
data. In other words, a priori information has to be implemented by the 



measurement process 131, 134 



37 



0.2 



20 40 60 80 100 20 40 60 80 100 

Figure 4: The l.h.s. shows a bounded random function which does not allow 
generalization from training to non-training data. Using a measurement 
device with input averaging (r.h.s.) or input noise the function becomes 
learnable. 



Again, a simple example may clarify this point. Consider the prior in- 
formation, that a function h is bounded, i.e., a < h(x) < b, Vx. A direct 
measurement of this prior assumption is practically not possible, as it would 
require to check every value h(x). An implementation within the measure- 
ment process is however trivial. One just has to use a measurement device 
which is only able to to produce output in the range between a and b. This 
is a very realistic assumption and valid for all real measurement devices. 
Values smaller than a and larger than b have to be filtered out or actively 
projected into that range. In case we nevertheless find a value out of that 
range we either have to adjust the bounds or we exchange the "malfunction- 
ing" measurement device with a proper one. Note, that this range filter is 
only needed at the finite number of actual measurements. That means, a 
priori information can be implemented by a posteriori control at the time of 
testing. 

A realistic measurement device does not only produce bounded output 
but shows also always input noise or input averaging. A device with input 
noise has noise in the x variable. That means if one intends to measure at 
Xi the device measures instead at Xi + A with A being a random variable. A 
typical example is translational noise, with A being a, possibly multidimen- 
sional, Gaussian random variable with mean zero. Similarly, a device with 
input averaging returns a weighted average of results for different x values 
instead of a sharp result. Bounded devices with translational input noise, for 
example, will always measure smooth functions ||128| , |131| . (See Fig. f|.) 



This may be an explanation for the success of smoothness priors. 
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The last example shows, that to obtain adequate a priori information 
it can be helpful in practice to analyze the measurement process for which 
learning is intended. The term "measurement process" does here not only 
refer to a specific device, e.g., a box on the table, but to the collection of all 
processes which lead to a measurement result. 

We may remark that measuring a measurement process is as difficult or 
impossible as a direct measurement of a priori information. What has to 
be ensured is the validity of the necessary restrictions during a finite num- 
ber of actual measurements. This is nothing else than the implementation 
of a probabilistic rule producing y given the test situation and the training 
data. In other words, what has to be implemented is the predictive density 
p(y\x, D). This predictive density indeed only depends on the actual test 
situation and the finite number of training data. (Still, the probability den- 
sity for a real y cannot strictly be empirically verified or controlled. We may 
take it here, for example, as an approximate statement about frequencies.) 
This shows the tautological character of learning, where measuring a priori 
information means controlling directly the corresponding predictive density. 

The a posteriori interpretation of a priori information can be related to 
a constructivistic point of view. The main idea of constructivism can be 
characterized by a sentence of Vico (1710): Verum ipsum factum — the 
truth is the same as the made |222|| . (For an introduction to constructivism 
see 



2271 and references therein, for constructive mathematics see |Ea 



3 Gaussian prior factors 

3.1 Gaussian prior factor for log— probabilities 

3.1.1 Lagrange multipliers: Error functional El 

In this chapter we look at density estimation problems with Gaussian prior 
factors. We begin with a discussion of functional priors which are Gaussian in 
probabilities or in log-probabilities, and continue with general Gaussian prior 
factors. Two section are devoted to the discussion of covariances and means 
of Gaussian prior factors, as their adequate choice is essential for practical 
applications. After exploring some relations of Bayesian field theory and 
empirical risk minimization, the last three sections introduce the specific 
likelihood models of regression, classification, inverse quantum theory. 
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We begin a discussion of Gaussian prior factors in L. As Gaussian prior 
factors correspond to quadratic error (or energy) terms, consider an error 
functional with a quadratic regularizer in L 

(L,KL) = \\L\\l c = ^J dxdydx'dy'L(x,y)K(x,y;x',y')L(x',y'), (106) 

writing for the sake of simplicity from now on L(x, y) for the log-probability 
L(y\x, h) = \np(y\x, h). The operator K is assumed symmetric and positive 
semi-definite and positive definite on some subspace. (We will understand 
positive semi-definite to include symmetry in the following.) For positive 
(semi) definite K the scalar product defines a (semi) norm by 



K 



(L,KL), (107) 



and a corresponding distance by \\L — L'\\k. The quadratic error term (|106|) 
corresponds to a Gaussian factor of the prior density which have been called 
the specific prior p(h\D ) = p(L\D ) for L. In particular, we will consider 
here the posterior density 

j)(h\f) = e^^^'H Jdxdydx'dy'L(x,y)K(x,y;x',y')L(x',y')+JdxA x (x)(l-Jdye L ^-y^)+c, 

(108) 

where prefactors like j3 are understood to be included in K. The constant 
c referring to the specific prior is determined by the determinant of K ac- 
cording to Eq. (|70|). Notice however that not only the likelihood J2i Li but 
also the complete prior is usually not Gaussian due to the presence of the 
normalization conditions. (An exception is Gaussian regression, see Section 
|3.7| .) The posterior ( |108| ) corresponds to an error functional 

E L = (3E comh = — (L, N) + X -{L, KL) + (e L - 8(y), A x ), (109) 

with likelihood vector (or function) 

L(x,y) = L(y\x,h), (110) 

data vector (function) 

n 

N(x,y) = Y,5(x-x z )5(y-y t ), (111) 



40 



Lagrange multiplier vector (function) 

Ax(x,y)=A x (x), (112) 

probability vector (function) 

e L (x,y) = e L ^ = P(x,y) = p(y\x,h), (113) 

and 

5(y)(x,y) = 5(y). (114) 

According to Eq. (jlll|) N/ n = P emp is an empirical density function for the 
joint probability p(x,y\h). 

We end this subsection by defining some notations. Functions of vectors 
(functions) and matrices (operators), different from multiplication, will be 
understood element-wise like for example (e L )(x,y) = e L( - x ' y \ Only mul- 
tiplication of matrices (operators) will be interpreted as matrix product. 
Element-wise multiplication has then to be written with the help of diag- 
onal matrices. For that purpose we introduce diagonal matrices made from 
vectors (functions) and denoted by the corresponding bold letters. For in- 
stance, 

(115) 
(116) 
(117) 
(118) 
(119) 
(120) 



or 



I(x,y;x',y') = 


5(x — x')5(y — 


y'), 


L(x,y;x',y') = 


8{x — x')8(y — 


y')L(x,y), 


P(x,y;x',y') = 


e L (x,y;x',?/ / ) 






8(x — x')5(y — 


y')P(x,y), 


N(x,y;x',y') = 


S(x — x')8(y — 


y')N(x,y), 


A x (x,y;x' ' ,y') = 


5(x — x')5(y — 


y')A x (x), 


p = p/, e L = 


e L J, N = NI 


, A x = A X I 


I 


(x,y) = 1. 





L = LI, P = PI, e L = e h I, N = NI, A x = A X I, (121) 

where 

(122) 

Being diagonal all these matrices commute with each other. Element-wise 
multiplication can now be expressed as 

(KL)(x', y',x,y) = j dx" dy"K(x' , y', x", y")L(x", y", x, y) 

= J dx" dy"~K(x' , y ', x" , y")L(x, y)5(x — x")5(y — y") 
= K(x',y',x,y)L(x,y). (123) 
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In general this is not equal to L(x',y')K.(x',y',x,y). In contrast, the matrix 
product KL with vector L 

(KL)(x',y') = J dxdyK(x' ,y', x,y)L(x,y), 

does not depend on x, y anymore, while the tensor product or outer product, 
(K ® L)(x", y", x, y, x', y') = K(x",y",x',y')L(x,y), (125) 
depends on additional x", y". 



Taking the variational derivative of ( 108 ) with respect to L(x,y) using 

6 -§f^ = 5(x - x')5(y - y') (126) 

and setting the gradient equal to zero yields the stationarity equation 

= iV — KL — e L A x . (127) 

Alternatively, we can write e L Ax = Axe L = PAx- 

The Lagrange multiplier function Ax is determined by the normalization 
condition 

Z x (x) = I dye L(x ' y) = 1, Vz 6 X, (128) 
which can also be written 

Z x = \ X P = I x e L = I or Z X = I, (129) 

in terms of normalization vector, 

Z x {x,y) = Z x {x), (130) 

normalization matrix, 

Z x (x, y; x, y) = 5(x - x')5(y - y) Z x (x), (131) 
and identity on X, 

I x (x,y;x',y') = 8(x - x'). (132) 

Multiplication of a vector with I x corresponds to y-integration. Being a 
non-diagonal matrix Ix does in general not commute with diagonal matrices 
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like L or P. Note also that despite Ix& L — Ix eL ^ = 1/ = / in general IxP 
= Ix eL 7^ I — Zx- According to the fact that Ix and Ax commute, i.e., 



I X A X = A X I X & [A x , Ix] = A X I X - 1 X A X = 0, (133) 

(introducing the commutator [A, B] = AB — BA), and that the same holds 
for the diagonal matrices 

[A x ,e L ] = [A x ,P] = 0, (134) 
it follows from the normalization condition I X P = I that 

IxPAx = IxA x P = A X I X P = A x / = A X) (135) 

i.e., 

= (I - I x e L )A x = (I - IxP)Ax- (136) 

For A x (x) 7^ Eqs. (|135|J136|) are equivalent to the normalization (|128|) . If 
there exist directions at the stationary point L* in which the normalization of 
P changes, i.e., the normalization constraint is active, a Ax{x) ^ restricts 
the gradient to the normalized subspace (Kuhn-Tucker conditions [0, |19|, |99| , 
188| ). This will clearly be the case for the unrestricted variations of p(y,x) 



which we are considering here. Combining Ax = IxPAx for Ax(x) ^ with 
the stationarity equation ( |127|) the Lagrange multiplier function is obtained 

Ax = Ix (N - KL) = N x - (IxKL). (137) 

Here we introduced the vector 

N x = I X N, (138) 

with components 

N x (x,y) = N x (x) = J2^(x - x t ) = n x , (139) 

i 

giving the number of data available for x. Thus, Eq. (|137 ) reads in compo- 
nents 

A x (x) = J2S(x-Xi) - Jdy" dx'dy'K(x,y";x',y')L(x',y'). (140) 
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Inserting now this equation for Ax into the stationarity equation (|127|) yields 

= N -KL- e^iNx - I X KL) = (i - e L I x ) (JV - KL) . (141) 

Eq. ( |141| ) possesses, besides normalized solutions we are looking for, also 
possibly unnormalized solutions fulfilling N = KL for which Eq. (|137| ) yields 
Ax = 0. That happens because we used Eq. ( |135| ) which is also fulfilled 
for Ax (a;) = 0. Such a A x (x) = does not play the role of a Lagrange 
multiplier. For parameterizations of L where the normalization constraint is 
not necessarily active at a stationary point Ax{x) = can be possible for a 
normalized solution L*. In that case normalization has to be checked. 
It is instructive to define 

T L = N-A x e L , (142) 
) acquires the form 

KL = T L , (143) 
which reads in components 

J dx'dy' K{x, y; x', y')L(x', y') = ]T S(x - Xl )6(y - Vi ) - A x (x) e L ^ , (144) 

i 

which is in general a non-linear equation because depends on L. For 
existing (and not too ill-conditioned) K _1 the form ( |143|) suggest however 
an iterative solution of the stationarity equation according to 

L i+i = k -i Tl ( L *) ) ( 145 ) 

for discretized L, starting from an initial guess L°. Here the Lagrange multi- 
plier Ax has to be adapted so it fulfills condition ( |137| ) at the end of iteration. 
Iteration procedures will be discussed in detail in Section |7|. 



so the stationarity equation ( 



3.1.2 Normalization by parameterization: Error functional E g 

Referring to the discussion in Section |2.3| we show that Eq. (|141|) can alter- 
natively be obtained by ensuring normalization, instead of using Lagrange 
multipliers, explicitly by the parameterization 



L(x,y)=g(x,y)-\n J " dy' ' e^'\ L = g-\nZ x , 



(146) 
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and considering the functional 

E g = -(N, 9 -lnZ x ) + ~(g-lnZ x ,K(g-lnZ x )). (147) 

The stationary equation for g(x, y) obtained by setting the functional deriva- 
tive SE g /5g to zero yields again Eq. fll4lD . We check this, using 

= t lx -W<>* S J^ = I X ^ = UH X ) T , (148) 
bg{x,y) dg v ' 

and 

= S(x - x')6(y - y>) - 6(x - x')e L ^\ ^ = I - I x e\ (149) 
bg{x,y) dg 

where denotes a matrix, and the superscript T the transpose of a matrix. 
We also note that despite I x = I x 

I x e L ± e L I x = (Ixe L ) T , (150) 

is not symmetric because e L depends on y and does not commute with the 
non-diagonal Ix- Hence, we obtain the stationarity equation of functional 



E g written in terms of L(g) again Eq. (|141 



= " (^) T Jl = Gl " eL Ax = ( J " eLlx ) {N ~ KL) ■ (151) 

Here Gl = N — KL = —5E g /SL is the L-gradient of —E g . Referring to the 
discussion following Eq. ( |141| ) we note, however, that solving for g instead 
for L no unnormalized solutions fulfilling iV = KL are possible. 

In case In Z x is in the zero space of K the functional E g corresponds to 
a Gaussian prior in g alone. Alternatively, we may also directly consider a 
Gaussian prior in g 

E 9 = -(N,g-lnZ x )+^(g,Kg), (152) 

with stationarity equation 

= N - Kg - e L iV x . (153) 
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Notice, that expressing the density estimation problem in terms of g, nonlo- 
cal normalization terms have not disappeared but are part of the likelihood 
term. As it is typical for density estimation problems, the solution g can be 
calculated in X-data space, i.e., in the space defined by the Xi of the training 
data. This still allows to use a Gaussian prior structure with respect to the 



x-dependency which is especially useful for classification problems [231 



3.1.3 The Hessians H L , H ( 



The Hessian of —El is defined as the matrix or operator of second deriva- 
tives 

tt it\( i r- 5 2 (-E r 



5L(x,y)6L(x',y') 



(154) 



For functional ( |109| ) and fixed Ax we find the Hessian by taking the derivative 
of the gradient in (|127|) with respect to L again. This gives 

H. L (L)(x lV ;x'y') = -K(x, y; x'y') - 5(x - x')5{y - y')A x (x)e L ^ (155) 



or 



H L = -K - A x e L . 



(156) 



The addition of the diagonal matrix Aje L = e L Aj( can result in a negative 
definite H even if K has zero modes, like in the case where K is a differential 
operator with periodic boundary conditions. Note, however, that A^e 1 is 
diagonal and therefore symmetric, but not necessarily positive definite, be- 
cause Ax(x) can be negative for some x. Depending on the sign of A x (x) 
the normalization condition Zx{x) = 1 for that x can be replaced by the 
inequality Z x [x) < 1 or Z x [x) > 1. Including the L-dependence of Ax and 
with 



Sg(x,y) 



S(x - x')5{y - y')e L{x ' y) - 8(x - x')e L{x ' y) e L{x '' y '\ (157) 



i.e., 



~5g 



(l-e L Ix)e L = e L -e L Ixe L , 



(158) 



we find, written in terms of L, 

H g (L)(x,y;x',y') 



5 2 (-E n 



Sg(x,y)Sg(x',y') 
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,,,,( n~E 9 ) 6L(x»,y») | S(-E a ) 5>L(x»,y» 



5L(x,y)5L(x»,y») 5g(x',y') 5L(x",y") 5g(x,y)5g(x',y') 

-K(x, y; y') - e ^ J dy"dy"'K(x', y"; x, y'") 

+e L(x '' y,) J dy"K(x', y"; x, y) + e L(x ' y) J dy"K(x' , y'; x, y") 
-5(x-x')5(y-y')e L ^ (n x (x) - Jdy"(KL)(x,y") 



+6(x - x')e L ^e L ^'^ (n x (x) - J dy"(KL)(x, y")) . (159) 

The last term, diagonal in X, has dyadic structure in Y, and therefore for 
fixed x at most one non-zero eigenvalue. In matrix notation the Hessian 
becomes 

H s = - (i - e L I x ) K (i - I x e L ) - (i - e L I x ) A x e L 

- (I - PI X ) [K (I - I X P) + A X P] , (160) 

the second line written in terms of the probability matrix. The expression 
is symmetric under x <-> x',y <-> y', as it must be for a Hessian and as can 
be verified using the symmetry of K = K T and the fact that Ax and lx 
commute, i.e., [Ax,Ix] = 0. Because functional E g is invariant under a 
shift transformation, g(x,y) — > g'(x,y) + c(x), the Hessian has a space of 
zero modes with the dimension of X. Indeed, any y-independent function 
(which can have finite L 1 -norm only in finite V-spaces) is a left eigenvector 
of (l — e L Ix) with eigenvalue zero. The zero mode can be removed by pro- 
jecting out the zero modes and using where necessary instead of the inverse 
a pseudo inverse of H, for example obtained by singular value decomposi- 
tion, or by including additional conditions on g like for example boundary 
conditions. 



3.2 Gaussian prior factor for probabilities 

3.2.1 Lagrange multipliers: Error functional E P 

We write P(x,y) = p(y\x, h) for the probability of y conditioned on x and 
h. We consider now a regularizing term which is quadratic in P instead of 
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L. This corresponds to a factor within the posterior probability (the specific 
prior) which is Gaussian with respect to P. 

p(h\f) =e Y,i lnP i( x i>yi)-hf dxd y dx ' d v' ' P(x,y)K(x,y;x' ,y')P(x' ,y')+fdx A x (x)(l- f dy P(x,y))+c, 

(161) 

or written in terms of L = In P for comparison, 

(162) 

Hence, the error functional is 

E P = (3E corah = -(In P, N) + i(P, K P) + ( P - 5{y) , A*). (163) 
In particular, the choice K = |l, i.e., 

\{P,P) = \\\P\W (164) 

can be interpreted as a smoothness prior with respect to the distribution 
function of P (see Section [3.3|) . 

In functional ( |163| ) we have only implemented the normalization condition 
for P by a Lagrange multiplier and not the non-negativity constraint. This 
is sufficient if P(x, y) > (i.e., P(x, y) not equal zero) at the stationary point 
because then P(x, y) > holds also in some neighborhood and there are no 
components of the gradient pointing into regions with negative probabilities. 
In that case the non-negativity constraint is not active at the stationarity 
point. A typical smoothness constraint, for example, together with positive 
probability at data points result in positive probabilities everywhere where 
not set to zero explicitly by boundary conditions. If, however, the station- 
ary point has locations with P(x, y) — at non-boundary points, then the 
component of the gradient pointing in the region with negative probabili- 
ties has to be projected out by introducing Lagrange parameters for each 
P(x, y). This may happen, for example, if the regularizer rewards oscillatory 
behavior. 

The stationarity equation for Ep is 

= P^AT - KP - A x , (165) 

with the diagonal matrix P(x', y'\ x, y) = 6(x — x')5(y — y')P(x, y), or multi- 
plied by P 

= N- PKP - PA X . (166) 
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Probabilities P(x,y) are unequal zero at observed data points (xi,yi) so 
P _1 iV is well defined. 

Combining the normalization condition Eq. ( |135| ) for Ax(x) 7^ with Eq. 



( 165 ) or ( |166| ) the Lagrange multiplier function Ax is found as 

Ax = Ix(N- PKP) = N X - IxPKP, (167) 

where 

l x PKP(x,y) = Jdy'dx"dy"P{x,y')K(x,y';x",y")P{x",y"). 



Eliminating Ax in Eq. (|165|) by using Eq. (|167 ) gives finally 



= (I-IxP)(P _1 A r -KP), (168) 

or for Eq. (|166| ) 

= (I-PI X )(JV-PKP). (169) 

For similar reasons as has been discussed for Eq. ( 141 ) unnormalized solutions 
fulfilling N — PKP are possible. Defining 

T P = P^N - Ax = P _1 iV - N x - IxPKP, (170) 

the stationarity equation can be written analogously to Eq. ( |143 ) as 



KP = T P , (171) 
with T P = Tp(P), suggesting for existing K -1 an iteration 

P l+1 = K- 1 T P (P i ), (172) 
starting from some initial guess P°. 

3.2.2 Normalization by parameterization: Error functional E z 

Again, normalization can also be ensured by parameterization of P and solv- 
ing for unnormalized probabilities z, i.e., 

P(*,V)= rf X ) V) v P = ^T- (173) 
Jdyz{x,y) Z x 
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The corresponding functional reads 
E. = 



N \ Z \ + ^ ( Z K Z 

Z x ) 2 \Z X ' Z x 



We have 

5z ' 



5Z 



x 



5z 



I 



x- 



Slnz 

5z 



ZxP) 



5\nZ x 

5z 



with diagonal matrix z built analogous to P and Z x , and 
5P S(z/Z x ) „ x <51nP 



(174) 



(175) 



5z 



5z 

5Z x l 



z-/ (I - PI*) , 
-&x A x, 
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z^ 1 (p- 1 - I* 



P^Z^I-PI*) 



(176) 



(177) 



Sz 52 
The diagonal matrices [Zx,P] = commute, as well as [Z*,Ix] = 0, but 
[P, l x ) 7^ 0. Setting the gradient to zero and using 



(I -PI 



x, 



we find 



= - 



I-IxP 



'SPY 5E„ 



(178) 



x Sz J 5P 

;p- 1 -i x )iv-(i-i x p)KP 

= Z x l (I - I X P) (P-'N - KP) 
= Z~ x x (I - \ X P) Gp = Z x x (G P - A x ) = (G P - A x ) Z~ x \ (179) 

with P-gradient G P = P^N - KP = -5EJ5P of -E z and G P the cor- 
responding diagonal matrix. Multiplied by Z x this gives the stationarity 
equation (171). 



3.2.3 The Hessians H P , H z 

We now calculate the Hessian of the functional —Ep. For fixed Ax one finds 



the Hessian by differentiating again the gradient ( |165|) of — Ep 

S(x - Xi)6(y - Hi 



H P (P) (x, y; x'y') = -K(x'y'; x, y) - 8{x - x')8{y - y') £ 



P 2 (x,y) 



180) 
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i.e., 



K-P 2 N. 



181) 



Here the diagonal matrix P~ 2 N is non-zero only at data points. 

Including the dependence of Ax on P one obtains for the Hessian of — E z 
174J) by calculating the derivative of the gradient in ( |179|) 



m 



H z (x,y;x',y/) 



Z x (x) 



K(x,y;x',y/) 



- J dy"(p{x, y")K(x, y"; x', y') + K(x, y- x', y")p(x', y"] 
+ J dy"dy'"p(x, y")K(x, y"\ x', y"')p(x', y") 

+s{x _ x >) S ( y - y >) y: s ( x - x f(y-y*) _ S (x - x') e 5( x - Xt ) 

-S(x - x') J dx"dy"(K(x, y; x", y")p(x", y") + p(x", y")K(x" , y"; x', y' 



+ 2 5(x — x') / dy"dx"'dy"'p(x, y")~K(x, y"; x'", y m )p(x m ', y" 



Z x (x') 



;i82) 



i.e., 



H Z v 1 (I - IxP) (-K - P- 2 N) (I - PI X ) Z x x 

\ x l (l x (G P - A x ) + (G P - Ax) lx) Zx 1 , 
-Z x l [ (I - I X P) K (I - PIx) + P~ 2 N 
-IxP^'N - NP- x Ix + IxNIx 
+IxGp + Gplx — 2IxAx Z x ■ 



183) 



184) 



Here we used [Ax,Ix] = 0. It follows from the normalization Jdyp(x,y) = 
1 that any y-independent function is right eigenvector of (I — IxP) with 
zero eigenvalue. Because Ax = IxPCp this factor or its transpose is also 
contained in the second line of Eq. ( |183| ), which means that H 2 has a zero 
mode. Indeed, functional E z is invariant under multiplication of z with a 
y-independent factor. The zero modes can be projected out or removed by 
including additional conditions, e.g. by fixing one value of z for every x. 
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3.3 General Gaussian prior factors 
3.3.1 The general case 

In the previous sections we studied priors consisting of a factor (the specific 
prior) which was Gaussian with respect to P or L = InP and additional 
normalization (and non-negativity) conditions. In this section we consider 
the situation where the probability p(y\x, h) is expressed in terms of a func- 
tion (p(x,y). That means, we assume a, possibly non-linear, operator P = 
P(4>) which maps the function to a probability. We can then formulate 
a learning problem in terms of the function 0, meaning that now repre- 
sents the hidden variables or unknown state of Nature Consider the case 
of a specific prior which is Gaussian in 0, i.e., which has a log-probability 
quadratic in 

-I(0,K0). (185) 
This means we are lead to error functionals of the form 

= -( In P(0) ,iV) + i(0,K0) + ( P(0) , A x ), (186) 

where we have skipped the 0-independent part of the Ax-terms. In general 
cases also the non-negativity constraint has to be implemented. 

To express the functional derivative of functional ( |186j ) with respect to 
we define besides the diagonal matrix P = P(0) the Jacobian, i.e., the 
matrix of derivatives P' = P'(0) with matrix elements 

P [x,y;x ,y ; 0) = — — — . (187) 

d(f>(x,y) 

The matrix P' is diagonal for point-wise transformations, i.e., for P(x, y; 0) = 
P( 0(x, y) ). In such cases we use P' to denote the vector of diagonal elements 
of P'. An example is the previously discussed transformation L = InP for 
which P' = P. The stationarity equation for functional (|186|) becomes 

= P , (0)P" 1 (0)A^-K0-P'(0)A x , (188) 

and for existing PP' 1 =(P'P _1 ) _1 (for nonexisting inverse see Section |4.1|) , 

= N - PP' - PA X . (189) 

2 Besides <fi also the hyperparameters discussed in Chapter || belong to the hidden 
variables h. 
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P(cf>) 


constraints 


P(x,y) 


P = P 


norm 


non-negativity 


z(x,y) 


P = zJ Jzdy 




non-negativity 


L(x, y) = InP 


P = e L 


norm 




g(%,y) 


P = e 9/fe 9 dy 






$ = jv dy'P 


P = d$/dy 


boundary 


monotony 



Table 2: Constraints for specific choices of 

From Eq. ( |189|) the Lagrange multiplier function can be found by integration, 
using the normalization condition IxP = I, in the form IxPAx = Ax for 
Ax 0*0 7^ 0. Thus, multiplying Eq. (|189|) by l x yields 

A x =I X (N- PP /_1 K 0) = N x - IxPP'^K 0. (190) 

Ax is now eliminated by inserting Eq. ( |190| ) into Eq. ( |189| ) 

= (I - PIx) (N - PP' X K 0) . (191) 

A simple iteration procedure, provided K _1 exists, is suggested by writing 
Eq. ( |188| ) in the form 

K0 = T , l+1 = K-%(^), (192) 

with 

T^(0) =P'p- 1 iV-P / Ax. (193) 
Table ^ lists constraints to be implemented explicitly for some choices of 

0. 

3.3.2 Example: Square root of P 

We already discussed the cases = InP with P' = P = e L , P/P' = 1 and 
(j) = P with P' = 1, P/P' = P. The choice = y/P yields the common 
L 2 -normalization condition over y 

1= Jdy4>\x,y), VxGX, (194) 
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which is quadratic in 0, and P = 2 , P' = 20, P/P' = 0/2. For real the 
non-negativity condition P > is automatically satisfied [0, |206| . 

For = \f~P and a negative Laplacian inverse covariance K = —A, one 
can relate the corresponding Gaussian prior to the Fisher information |38| , 
[206 , p02j1 . Consider, for example, a problem with fixed a; (so x can be skipped 
from the notation and one can write P{y)) and a rf^-dimensional y. Then 
one has, assuming the necessary differentiability conditions and vanishing 
boundary terms, 



K 



A. 



Cly 



dy k 



where (0) is the Fisher information, defined as 

2 



(195) 



(196) 



J f (Vo) = J d y 



dP(y-y°) 

dyO 



P(y-y°) 



dy 



d\nP(y — y 



dyl 



P(y-y° k ), (197) 



for the family P(- — y°) with location parameter vector y°. 

A connection to quantum mechanics can be found considering the training 
data free case 



Ed 



0, K0) + (A x , 0), 



has the homogeneous stationarity equation 



K. 



-2$A 



x- 



(198) 



(199) 



For x-independent Ax this is an eigenvalue equation. Examples include 
the quantum mechanical Schrodinger equation where K corresponds to the 
system Hamiltonian and 



2A 



K. 



x 



(200) 



to its ground state energy. In quantum mechanics Eq. (|200|) is the basis 
for variational methods (see Section |j) to obtain approximate solutions for 
ground state energies @, |193| , |27|| . 
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Similarly, one can take = J—(L — L max ) for L bounded from above by 
L max with the normalization 



1 = / dye" 4, (^)+ L — , Vx G X, (201) 
and P = e -^ 2+Lm - , P' = -2</>P, P/P' = -1/(20). 
3.3.3 Example: Distribution functions 

Instead in terms of the probability density function, one can formulate the 
prior in terms of its integral, the distribution function. The density P is then 
recovered from the distribution function by differentiation, 

= n = n v y j = ® v. = r- v, (202) 

resulting in a non-diagonal P'. The inverse of the derivative operator R 1 
is the integration operator R = ®^ R&P with matrix elements 

R(x, y; x, y') = 5(x - x')8(y - y'), (203) 

i.e., 

R k (x, y- x', y') = 5(x - x') [J 5( Vl - y[)e{y k - y' k ). (204) 

Thus, (|202|) corresponds to the transformation of (x-conditioned) density 
functions P in (x-conditioned) distribution functions = HP, i.e., <p(x,y) = 
fj^ P(x, y')dy' . Because R T KR is (semi)-positive definite if K is, a specific 
prior which is Gaussian in the distribution function is also Gaussian in the 
density P. P' becomes 

y; x', y') = V J" \ ' = S(x - x') ft S'(y k - y' k ). (205) 
<W, y) Y 

Here the derivative of the 5-function is defined by formal partial integration 
dy , f(y , )5'(y-y')=f(y')8(y'-y)\™ 00 -f'(y). (206) 



Fixing 0(x, — oo) = the variational derivative 5/(5<fi(x, — oo)) is not needed. 
The normalization condition for P becomes for the distribution function 
= RP the boundary condition 0(x, oo) = 1, Vx G X. The non-negativity 
condition for P corresponds to the monotonicity condition 0(x, y) > 0(x, y'), 
Vy > y', Vx G X and to 0(x, — oo) > 0, Vx G X. 
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3.4 Covariances and invariances 



3.4.1 Approximate invariance 

Prior terms can often be related to the assumption of approximate invariances 
or approximate symmetries. A Laplacian smoothness functional, for exam- 
ple, measures the deviation from translational symmetry under infinitesimal 
translations. 

Consider for example a linear mapping 

-> = S0, (207) 

given by the operator S. To compare with we define a (semi-) distance 
defined by choosing a positive (semi-) definite Kg, and use as error measure 

~((0-S0),K 5 (0-S0)) = ^(0,K0). (208) 

Here 

K = (I - S) T K 5 (I - S) (209) 

is positive semi-definite if Ks is. Conversely, every positive semi-definite K 
can be written K = W T W and is thus of form (|209|) with S = I — W and 



Kg = I. Including terms of the form of (|209|) in the error functional forces 
to be similar to 0. 

A special case are mappings leaving the norm invariant 

(0, 0) = (S0,S0) = (0, S T S0). (210) 

For real and i.e., (S0) = (S0)*, this requires S T = S _1 and S* = S. 
Thus, in that case S has to be an orthogonal matrix G O(N) and can be 
written 

OO 1 / \ & 

8(0) = e A = e ^ = g, - etAj , (211) 

with antisymmetric A = — A T and real parameters Q{. Selecting a set of 
(generators) Aj the matrices obtained be varying the parameters 6{ form a 
Lie group. Up to first order the expansion of the exponential function reads 
S ~ 1 + Si^Aj. Thus, we can define an error measure with respect to an 
infinitesimal transformation by 

-(1 + ^)0 0-(l + 0iA i )0\ 1/- V-aTiv- a ^ 

'e i — ' s — Oi — J = 2 (0 ' z2 A I K s^i<P)- 

(212) 




56 



3.4.2 Approximate symmetries 

Next we come to the special case of symmetries, i.e., invariance under un- 
der coordinate transformations. Symmetry transformations S change the 
arguments of a function 0. For example for the translation of a function 
4>(x) — > 4>(x) = S<f>(x) = <f>(x — c). Therefore it is useful to see how S acts 
on the arguments of a function. Denoting the (possibly improper) eigen- 
vectors of the coordinate operator x with eigenvalue x by (•, x) = \x), i.e., 
x|x) = x\x), function values can be expressed as scalar products, e.g. 0(x) 
= (x, 0) for a function in x, or, in two variables, <f>(x, y) — (x <S> y, 0). (Note 
that in this 'eigenvalue' notation, frequently used by physicists, for example 
2\x) 7^ |2x).) Thus, we see that the action of S on some function h(x) is 
equivalent to the action of S T ( = S _1 if orthogonal) on \x) 

S(f)(x) = (x, S0) = {S T x, 0), (213) 

or for (f)(x, y) 

S ( j>(x,y) = (S T (x®y), 0) . (214) 
Assuming S = S^S^ we may also split the action of S, 

S0(z,y) = ((S^)(g)y, S,0). (215) 

This is convenient for example for vector fields in physics where x and 0(-, y) 
form three dimensional vectors with y representing a linear combination of 
component labels of 0. 

Notice that, for a general operator S, the transformed argument S\x) 
does not have to be an eigenvector of the coordinate operator x again. In 
the general case S can map a specific \x) to arbitrary vectors being linear 
combinations of all \x'), i.e., S\x) = Jdx' S(x, x')\x'). A general orthogonal S 
maps an orthonormal basis to another orthonormal basis. Coordinate trans- 
formations, however, are represented by operators S, which map coordinate 
eigenvectors \x) to other coordinate eigenvectors \a{x)). Hence, such coor- 
dinate transformations S just changes the argument x of a function into 
<t(x), i.e., 

S0(x) = 0H*)), (216) 

with a(x) a permutation or a one-to-one coordinate transformation. Thus, 
even for an arbitrary nonlinear coordinate transformation a the correspond- 
ing operator S in the space of is linear. (This is one of the reasons why 
linear functional analysis is so useful.) 
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A special case are linear coordinate transformations for which we can 
write 4>(x) — * <fi(x) = S(f)(x) = <fi(Sx), with S (in contrast to S) acting in the 
space of x. An example of such S are coordinate rotations which preserve 
the norm in rc-space, analogously to Eq. (|210| ) for 0, and form a Lie group 
S(9) = e^i diAi acting on coordinates, analogously to Eq. ( |211| ). 

3.4.3 Example: Infinitesimal translations 

A Laplacian smoothness prior, for example, can be related to an approxi- 
mate symmetry under infinitesimal translations. Consider the group of d- 
dimensional translations which is generated by the gradient operator V. This 
can be verified by recalling the multidimensional Taylor formula for expan- 
sion of <p at x 

_ oo /v-i n rr \k 

S(eU(x) = e £^ v *0(x) = y J 4>{x) = 4>{x + 0). (217) 

Up to first order S ~ 1 + J2i Hence, for infinitesimal translations, the 
error measure of Eq. ( |212| ) becomes 

i \ 1 1 / i 

(218) 

assuming vanishing boundary terms and choosing Kg = I. This is the clas- 
sical Laplacian smoothness term. 

3.4.4 Example: Approximate periodicity 

As another example, lets us discuss the implementation of approximate pe- 
riodicity. To measure the deviation from exact periodicity let us define the 
difference operators 

V$(f>(x) = <f>(x)-(f>{x + 6), (219) 
V^0(x) = <f>(x-9)-<f>(x). (220) 

For periodic boundary conditions (Vg ) T = — Vf, where (Vg ) T denotes the 
transpose of Vg . Hence, the operator, 

A, = v£V? = -(V^) T V, R , (221) 
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defined similarly to the Laplacian, is positive definite, and a possible error 
term, enforcing approximate periodicity with period 9, is 

\(V R (9)<f>, V*(0)0) = -^(0, A e 0) = \jdx \<f>[x) - <j>{x + 9)\ 2 . (222) 

As every periodic function with 4>(x) = <p{x + 9) is in the null space of Ag 
typically another error term has to be added to get a unique solution of the 
stationarity equation. Choosing, for example, a Laplacian smoothness term, 
yields 

--(«/>, (A + \A e ) cp). (223) 

In case 9 is not known, it can be treated as hyperparameter as discussed in 
Section [5]. 

Alternatively to an implementation by choosing a semi-positive definite 
operator K with symmetric functions in its null space, approximate symme- 
tries can be implemented by giving explicitly a symmetric reference function 
t(x). For example, |(0 — t, K(0 — t) ) with t(x) = t(x + 9). This possibility 
will be discussed in the next section. 



3.5 Non— zero means 

A prior energy term (1/2) (0, K0) measures the squared K-distance of to 
the zero function t = 0. Choosing a zero mean function for the prior process 
is calculationally convenient for Gaussian priors, but by no means mandatory. 
In particular, a function cf) is in practice often measured relative to some non- 
trivial base line. Without further a priori information that base line can in 
principle be an arbitrary function. Choosing a zero mean function that base 
line does not enter the formulae and remains hidden in the realization of the 
measurement process. On the the other hand, including explicitly a non- 
zero mean function t, playing the role of a function template (or reference, 
target, prototype, base line) and being technically relatively straightforward, 
can be a very powerful tool. It allows, for example, to parameterize t(9) by 
introducing hyperparameters (see Section ||) and to specify explicitly different 



maxima of multimodal functional priors (see Section |]. [131, |132j , |133| , 134 , 



135| ). All this cannot be done by referring to a single baseline. 
Hence, in this section we consider error terms of the form 

l(0-t,K(0-t)). (224) 
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Mean or template functions t allow an easy and straightforward implementa- 
tion of prior information in form of examples for 0. They are the continuous 
analogue of standard training data. The fact that template functions t are 
most times chosen equal to zero, and thus do not appear explicitly in the 
error functional, should not obscure the fact that they are of key importance 
for any generalization. There are many situations where it can be very valu- 
able to include non-zero prior means explicitly. Template functions for can 
for example result from learning done in the past for the same or for similar 
tasks. In particular, consider for example <f>(x) to be the output of an empiri- 
cal learning system (neural net, decision tree, nearest neighbor methods, . . .) 
being the result of learning the same or a similar task. Such a 0(x) would be 
a natural candidate for a template function t(x). Thus, we see that template 
functions could be used for example to allow transfer of knowledge between 
similar tasks or to include the results of earlier learning on the same task in 
case the original data are lost but the output of another learning system is 
still available. 

Including non-zero template functions generalizes functional E& of Eq. 



fllgq) to 



= -(lnP(0), AO + i(0-t,K(0-t)) + (P(0),A x ) (225) 

= -(lnP(0), N) + ^(0, K0) - (J, 0) + (P(0), A x )+const.(226) 

In the language of physics J = Kt represents an external field coupling to 
4>(x,y), similar, for example, to a magnetic field. A non-zero field leads to a 
non-zero expectation of in the no-data case. The 0-independent constant 
stands for the term Kt), or |(J, K -1 J) for invertible K, and can be 
skipped from the error/energy functional E§. 

The stationarity equation for an with non-zero template t contains 
an inhomogeneous term K£ = J 

= P'(0)P- 1 (0)AA - P'(0)A x - K (0 - t) , (227) 

with, for invertible PP' 1 and Ax ^ 0, 

Ax = Ix(N- PP'-'K (0 - tj) . (228) 

Notice that functional (|225| ) can be rewritten as a functional with zero tem- 
plate t = in terms of = <f>—t. That is the reason why we have not included 
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non-zero templates in the previous sections. For general non-additive com- 
binations of squared distances of the form ( |224|) non-zero templates cannot 
be removed from the functional as we will see in Section |6|. Additive combi- 
nations of squared error terms, on the other hand, can again be written as 
one squared error term, using a generalized 'bias-variance'-decomposition 

i JV i 

- - tj, K, (0 - *,-)) =-U-t,K(<f>-t))+ E min (229) 
with template average 



N 



t = K~ l J2 K jtj, (230) 

3=1 

assuming the existence of the inverse of the operator 

N 

K = E K r (231) 

3=1 

and minimal energy /error 

N 1 N 

E min = -V(t u ■■■t N ) = - Y,{tj, Kj tj) - (t, Kt), (232) 

3=1 

which up to a factor N/2 represents a generalized template variance V. We 
end with the remark that adding error terms corresponds in its probabilistic 
Bayesian interpretation to ANDing independent events. For example, if we 
wish to implement that is likely to be smooth AND mirror symmetric, we 
may add two squared error terms, one related to smoothness and another to 
mirror symmetry. According to ( |229| ) the result will be a single squared error 
term of form 



Summarizing, we have seen that there are many potentially useful ap- 
plications of non-zero template functions. Technically, however, non-zero 
template functions can be removed from the formalism by a simple substitu- 
tion 0' = — t if the error functional consists of an additive combination of 
quadratic prior terms. As most regularized error functionals used in practice 
have additive prior terms this is probably the reason that they are formulated 
for t = 0, meaning that non-zero templates functions (base lines) have to be 
treated by including a preprocessing step switching from to 0'. We will see 
in Section ^| that for general error functionals templates cannot be removed 
by a simple substitution and do enter the error functionals explicitly. 
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3.6 Quadratic density estimation and empirical risk 
minimization 



Interpreting an energy or error functional E probabilistically, i.e., assuming 
— /3E + c to be the logarithm of a posterior probability under study, the 
form of the training data term has to be — X^m-Pi- Technically, however, it 
would be easier to replace that data term by one which is quadratic in the 
probability P of interest. 

Indeed, we have mentioned in Section |2.5| that such functionals can be 
justified within the framework of empirical risk minimization. From that 
Frequentist point of view an error functional E(P), is not derived from a 
log-posterior, but represents an empirical risk r(P, f) = J2iK x i^Vi^ P)i a P~ 
proximating an expected risk r(P, f) for action a = P. This is possible 
under the assumption that training data are sampled according to the true 
p(x,y\f). In that interpretation one is therefore not restricted to a log-loss 
for training data but may as well choose for training data a quadratic loss 
like 

\{P~ Pemp, Kfl (P - Pemp)) , (233) 

choosing a reference density Pemp and a real symmetric positive (semi-)/- 
definite Kb. 

Approximating a joint probability p(x, y\h) the reference density P em p 
would have to be the joint empirical density 

1 n 

PZf(x, y) = - £ 6{z - Xi)S{y - W ), (234) 

i 

i.e., P^p* = N/n, as obtained from the training data. Approximating con- 
ditional probabilities p(y\x, h) the reference P em p has to be chosen as condi- 
tional empirical density, 

P^(*,V) = **'-*M-*) = !!MI I (235) 
T,iS{x-Xi) n x 

or, defining the diagonal matrix Nx(x, x', y, y') = 8(x — x')5(y — y')N x (x) = 
5(x - x')5(y - y') Ei S(x - x { ) 

Pemp = N^iV. (236) 

This, however, is only a valid expression if Nx{x) ^ 0, meaning that for all 
x at least one measured value has to be available. For x variables with a 
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large number of possible values, this cannot be assumed. For continuous x 
variables it is even impossible. 

Hence, approximating conditional empirical densities either non-data x- 
values must be excluded from the integration in ( |233| ) by using an operator 



Kfl containing the projector J2x'ex D 5(x—x'), or P cmp must be defined also for 
such non-data a;- values. For existing Vx — 1^1 — fdy 1, a possible extension 
-Pemp of -Pemp would be to assume a uniform density for non-data x values, 
yielding 

r, \ — for s-j; htO, 

for : (237) 

This introduces a bias towards uniform probabilities, but has the advantage 
to give a empirical density for all x and to fulfill the conditional normalization 
requirements. 

Instead of a quadratic term in P, one might consider a quadratic term in 
the log-probability L. The log-probability, however, is minus infinity at all 
non-data points (x, y) ^ D. To work with a finite expression, one can choose 
small e(y) and approximate P emp by 



g( y) + T,jS(x - Xj)6(y - yi) 
Jdye(y) + J2iS(x-Xi 



provided Jdye(y) exists. For e(y) ^ also P^ mp (x,y) ^ 0, Va; and L € emp = 
lnP e e mp > — oo exists. 

A quadratic data term in P results in an error functional 

E P = i(P - Pemp, K D (P - P cmp )) + i(P, KP) + (P, A x ), (239) 



skipping the constant part of the Ax-terms. In (|239|) the empirical density 



P emp may be replaced by P em p of (|237| ) 



Positive (semi-) definite operators have a square root and can be 
written in the form R T R. One possibility, skipping for the sake of simplicity 
x in the following, is to choose as square root R the integration operator, i.e., 
R = ® fe Rfc and R(y, y') = 0(y — y'). Thus, (p = RP transforms the density 
function P in the distribution function 0, and we have P = P(</>) = R~V- 
Here the inverse R -1 is the differentiation operator Yik ^y k (with appropriate 
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boundary condition) and (R T ) R 1 = — Uk ^-k is the product of one- 
dimensional Laplacians = d 2 /dy\. Adding for example a regularizing 
term QTBg ) f (P, P) gives 

~ ( P - P cmp , R T R (P - P cmp ) ) + \ (P, P) (240) 



E, 



~ ^(0-0emp, 0-0emp) ~ \{<f>, ]]_ A k (f)^j 



(241) 



JjAfc +m 2 l)(j)) - (0,0emp) + ^(0emp, 0emp)- (242) 



,2 _ \-l 



with m = A . Here the empirical distribution function e mp = R-Pemp is 
given by <fr cmp (y) = ~ % ~ Hi) (or, including the x variable, (j) emp {x, y) = 
N \ x \ Y,x'£x D $( x ~ x ')Q(y ~ Hi) f° r Nx(x) 7^ which could be extended to a 
linear <p = RP em p for Nx(x) = 0). The stationarity equation yields 



m 



2 



~l[A k + m 2 l) emp . (243) 



For d y — 1 (or = Yik^) the operator becomes (— A + m 2 I) 1 which has 
the structure of a free massive propagator for a scalar field with mass m 2 
and is calculated below. As already mentioned the normalization and non- 
negativity condition for P appear for </> as boundary and monotonicity con- 
ditions. For non-constant P the monotonicity condition has not to be im- 
plemented by Lagrange multipliers as the gradient at the stationary point 
has no components pointing into the forbidden area. (But the conditions 
nevertheless have to be checked.) Kernel methods of density estimation, like 
the use of Parzen windows, can be founded on such quadratic regularization 



functionals [219]. Indeed, the one-dimensional Eq. ([243|) is equivalent to the 



use of Parzens kernel in density estimation [ |179| , |16(fl . 



3.7 Regression 

3.7.1 Gaussian regression 

An important special case of density estimation leading to quadratic data 
terms is regression for independent training data with Gaussian likelihoods 

p( yi \xi,h) = -=^e & , (244) 

V Z7TCX 
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with fixed, but possibly ^-dependent, variance a 2 . In that case P(x,y) = 
p(yi\xi,h) is specified by <fi = h and the logarithmic term Si m -Pj becomes 
quadratic in the regression function h(xi), i.e., of the form ( |224j ). In an inter- 
pretation as empirical risk minimization quadratic error terms corresponds 
to the choice of a squared error loss function l(x,y,a) = (y — a(x)) 2 for ac- 
tion a(x). Similarly, the technical analogon of Bayesian priors are additional 
(regularizing) cost terms. 

We have remarked in Section |2.3| that for continuous x measurement of 
h(x) has to be understood as measurement of a h(x) = Jdx , d(x)h(x) for 
sharply peaked $(x). We assume here that the discretization of h used in 
numerical calculations takes care of that averaging. Divergent quantities like 
(5-functionals, used here for convenience, will then not be present. 

We now combine Gaussian data terms and a Gaussian (specific) prior 
with prior operator K.q(x,x') and define for training data Xi, yi the operator 

Ki(x,x') = 5(x - Xi)5(x - x'), (245) 

and training data templates t — yi. We also allow a general prior template 
to but remark that it is often chosen identically zero. According to ( |229|) the 
resulting functional can be written in the following forms, useful for different 
purposes, 

Eh = Ht(h(xi)-Vi) 2 + ~(h-ta, K (h-t )) x (246) 
1 i=i 1 

1 n 1 
= o K i {h-t i )) x + -{h-t , K {h-t )) x (247) 

1 8=1 1 

= ±(h- t D , K D (h - t D )) x +^(h - t , K (h - to))x+E^ n (248) 
= l(h-t,K{h-t)) x + E mill , (249) 



with 



K D = J2 Kj, t D = K D X £ Kiti, (250) 



K = ^K,, t = K- 1 ^KA, (251) 

j=0 i=0 

and /z-independent minimal errors, 

E^n = UiL ft. K ^)x + (t D , K D t D ) x ) , (252) 
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£mi„ = \ (jt K ^)x + (t, Kt) x ) , (253) 

being proportional to the "generalized variances" Vd = 2E® in /n and V = 
2E min /(n + 1). The scalar product (-,-)x stands for ^-integration only, for 
the sake of simplicity however, we will skip the subscript X in the following. 
The data operator 

n 

Kd(x, x') = S(x — Xi)S(x — x ) = n x S(x — x'), (254) 
i=i 

contains for discrete x on its diagonal the number of measurements at x, 

n 

n x = N x (x)=J2 S ( x - x i)' ( 255 ) 
i=i 

which is zero for x not in the training data. As already mentioned for con- 
tinuous x a integration around a neighborhood of Xj is required. K^ 1 is a 
short hand notation for the inverse within the space of training data 

Kp 1 = (IdKoIo)- 1 = 5{x - x')/n X} (256) 

Id denoting the projector into the space of training data 

h 

I D = 5{x-x')J2H x - x i)- ( 257 ) 

i=l 

Notice that the sum is not over all n training points Xi but only over the 
n < n different Xj. (Again for continuous x an integration around X; is 
required to ensure I 2 D = I^). Hence, the data template tjj becomes the mean 
of y-values measured at x 

J n x 

= — E v( x j)> ( 258 ) 

and tn{x) = for n x = 0. Normalization of P(x,y) is not influenced by a 
change in h(x) so the Lagrange multiplier terms have been skipped. 
The stationarity equation is most easily obtained from ( |249| ) , 

= K(/i-t). (259) 
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It is linear and has on a space where K 1 exists the unique solution 

h = t. (260) 

We remark that K can be invertible (and usually is so the learning problem 
is well defined) even if Ko is not invertible. The inverse K -1 , necessary to 
calculate t, is training data dependent and represents the covariance opera- 
tor/matrix of a Gaussian posterior process. In many practical cases, however, 
the prior covariance Kq 1 (or in case of a null space a pseudo inverse of Ko) 
is directly given or can be calculated. Then an inversion of a finite dimen- 
sional matrix in data space is sufficient to find the minimum of the energy 
E h [[2231 0]- 

Invertible Ko: Let us assume first deal with the case of an invertible 
K . It is the best to begin the stationarity equation as obtained from fl247|) 
or (|H) 

= f^K^-^+Ko^-to) (261) 

»=i 

= K D (h-t D )+K Q (h-t ). (262) 

h = t + K 1 K D (t D -h), (263) 
a = K D (t D - h), (264) 



For existing Kq 1 
one can introduce 
to obtain 



h = t + Ko l a. (265) 
Inserting Eq. fl265|) into Eq. (|264j) one finds an equation for a 

(l + K D K 1 )a = K D (t D -t ). (266) 



Multiplying Eq. (|266|) from the left by the projector 1^ and using 

K D I D = I D K D , a = l D a, t D = I D t D , (267) 
one obtains an equation in data space 

(l D + K D K~k D ) a = K D (t D - t , D ), (268) 
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where 

Koiz? = (Kq X )dd = IdKq 'Id ± (^o,dd)~\ t 0>D = I D t . (269) 

Thus, 

a = C DD b, (270) 

where 

C DD =(l D + K D K } )D y\ (271) 

and 

b = K D (t D -t ). (272) 
In components Eq. ( |270| ) reads, 

{$ki + n Xk .K.Q 1 (x k , x t fj a(xi) = n Xk {t D (x k ) - t (x k )) . (273) 

i 

Having calculated a the solution h is given by Eq. ( |265| ) 

h = t + K^Cnnb = t Q + K„ 1 (k d 1 + K^,)" 1 (t D - t ). (274) 



Eq. ( p74j) can also be obtained directly from Eq. (|260|) and the definitions 
( |251| ), without introducing the auxiliary variable a, using the decomposition 
K t = -K D t + (K + K D )t and 



K ^ = K 1 I + K D K 1 K D = K 1 J2 "KdK 1 K D (275) 



m=0 



K o X E (-K fl I D K - 1 I D ) m K D = K - 1 (l D + K D K ^ D ) _ K D . (276) 



m=0 



K 1 Cdd is also known as equivalent kernel due to its relation to kernel 
smoothing techniques ||205| , |94| , p0| , [76[ . 



Interestingly, Eq. (|265|) still holds for non-quadratic data terms of the 
form gn{h) with any differentiable function fulfilling g(h) = g(hu), where hp 
= \oh is the restriction of h to data space. Hence, also the function of func- 
tional derivatives with respect to h(x) is restricted to data space, i.e., g'iho) 
— 9d(^d) with g' D = I^g' and g'(h,x) = Sg(h)/Sh(x). For example, g(h) = 
J2i=i V{h(xi) — Ui) with V a differentiable function. The finite dimensional 
vector a is then found by solving a nonlinear equation instead of a linear one 

n mi- 
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Furthermore, one can study vector fields, i.e., the case where, besides 
possibly x, also y, and thus h(x), is a vector for given x. (Considering the 
variable indicating the vector components of y as part of the a;-variable, this 
is a situation where a fixed number of one-dimensional y, corresponding to a 
subspace of X with fixed dimension, is always measured simultaneously.) In 
that case the diagonal Kj of Eq. ( |245| ) can be replaced by a version with non- 
zero off-diagonal elements K a)Q / between the vector components a of y. This 
corresponds to a multi-dimensional Gaussian data generating probability 

p(yi\xi,h) = ^ K f e~ ^ fo-*-M*i)) K,, n y (*0fo,„'- V (*0) ; ( 2 77) 
(27r) 2 

for fc-dimensional vector ?/j with components yj )Q! . 

Non-invertible Ko: For non-invertible Ko one can solve for h using 
the Moore-Penrose inverse Kjf. Let us first recall some basic facts 0, |161 



[T5], A pseudo inverse of (a possibly non-square) A is defined by the 



conditions 

A # AA # = A, AA # A = A # , (278) 
and becomes for real A the unique Moore-Penrose inverse A* if 

(AA # ) T = AA # , (A # A) T = A # A. (279) 

A linear equation 

Ax = b (280) 

is solvable if 

AA*b = b. (281) 

In that case the solution is 



x 



A*b + x = A*b + y-A*Ay, (282) 



where x° = y — A* Ay is solution of the homogeneous equation Ax° = and 
vector y is arbitrary. Hence, x° can be expanded in an orthonormalized basis 
ipi of the null space of A 

E c ^- ( 283 ) 



x 



o 



l 

For an A which can be diagonalized, i.e., A = M _1 DM with diagonal D, 
the Moore-Penrose inverse is A* = M _1 D*M. Therefore 

AA # = A # A = I x = I - I . (284) 
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where I = J2i ^i^J is the projector into the zero space of A and Ii = I — 1 
= M^DD*^! Thus, the solvability condition Eq. (]28lD becomes 

I b = 0, (285) 

or in terms of ipi 

(Vi, 6) =0, VZ, (286) 

meaning that the inhomogeneity b must have no components within the zero 
space of A. 

Now we apply this to Eq. ( [262] ) where K is diagonalizable because pos- 
itive semi definite. (In this case M is an orthogonal matrix and the entries 
of D are real and larger or equal to zero.) Hence, one obtains under the 
condition 

I (K t + K D (t D -h)) = 0, (287) 

for Eq. (HD 

h = K* (K t + K D (t D - h)) + h°, (288) 

where Ko/i° = so that h° = J2i citfi can be expanded in an orthonormalized 
basis ipi of the null space of Ko, assumed here to be of finite dimension. To 
find an equation in data space define the vector 

a = K D (t D - h), (289) 

to get from Eqs.( [287| ) and (|288| ) 

= (Vz, K t ) + (^, a), V/ (290) 

h = K#(K t + «) + E c ^- ( 291 ) 

l 

These equations have to be solved for a and the coefficients q. Inserting Eq. 
( |291| ) into the definition ( |289| ) gives 

(I + K D K*)a = K D t D - Knhto - K D £ crfu (292) 

i 

using K*K = Ii according to Eq. ( p84| ). Using a = Ioa the solvability 
condition ( |287| ) becomes 

n 

$3^,(a; i )a = -(Vi J Koto),V/, (293) 

i=l 
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the sum going over different Xi only. Eq. ( |292| ) for a and q reads in data 
space, similar to Eq. (|268|) , 

a = Cb, (294) 

where C _1 = I + K/jK* has been assumed invertible and b is given by the 
right hand side of Eq. ( |292|) . Inserting into Eq. (|291|) the solution finally can 
be written 

/i = I 1 t + K#C6 + ^c^. (295) 

i 

Again, general non-quadratic data terms g{ho) can be allowed. In that 
case Sg(hD)/Sh(x) = g'iho^x) = (Id(/)(/id, ^) and Eq. ( [289| ) becomes the 
nonlinear equation 



a = g\h D ) = g\\ D (K # (K t + K D (t D - h)) + h )) . (296) 
The solution(s) a of that equation have then to be inserted in Eq. ( |291| ). 



3.7.2 Exact predictive density 

For Gaussian regression the predictive density under training data D and 
prior D can be found analytically without resorting to a saddle point ap- 
proximation. The predictive density is defined as the /i-integral 

P (y\x,D,D ) = J d h P (y\ X ,hMh\D,D ) 

Jdhp(y\x, h)p(y D \x D , ti)p(h\D ) 



Jdhp(y D \x D , h)p(h\D ) 
p(y,y D \x,x D ,D ) 



p(yD\x D ,D D 



(297) 



Denoting training data values y,i by t,i sampled with covariance Kj concen- 
trated on Xi and analogously test data values y = y n+ \ by t n+ i sampled with 
(co-) variance K n+1 , we have for 1 < i < n + 1 



p(y l \x t ,h)=det(K l /2ir)2e 5 V" — "V, (298) 
and 

p(h\D ) = det(K /2n)h^( h ~ to ' Koih ~ to) ) , (299) 
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hence 

r dh "J T,7=0 {h-tuKiih-tS) +| Etc lndet I (K l /2^) 

p(y\x, D, D Q ) = i , . (300) 

J dh ~\ Elo (h-tuKtih-ti) J +| ^to lnd e t ! (K l /2 7 r) 

Here we have this time written explicitly detj(Kj/27r) for a determinant calcu- 
lated in that space where Kj is invertible. This is useful because for example 
in general det« Kj det K 7^ detj KjK . Using the generalized 'bias-variance'- 
decomposition Q229| ) yields 

fdh -3 (*-*+> K +( h -^))+f v ++IE^ llndot i( K i/ aff ) 
p(y\x, D, D ) = J — , , (301) 



Jdhe 



with 



a 



t = K-^KA, K = ^K,, (302) 

i=0 i=0 
n+1 n+1 

t+ = K; 1 ^^, K + = ^K,, (303) 

i=0 i=0 
n+1 



1 n+1 K 

-E(^K A )-(t + , -±t + ). (305) 



Now the /i-integration can be performed 

e -§ + i ^ri 1 lndct l (K l /2 7 r)-i lndct(K+/27r) 
P(y|x, -D, Do) = e -^ + lELo^^(K I /2,)-llnd Ct (K/2,) ( 306 ) 

Canceling common factors, writing again y for t n+ %, K. x for K n+1 , det^ for 
det n+ i, and using K + t + = Kt + K. x y, this becomes 

p(y\x,D,D ) = e -^y^yy)+(y,K y t)+l(t,(KK- 1 K-K)t)+^lndct x (K x K- 1 K/2n)_ 

(307) 

Here we introduced K. y = = K x . — K^K^Kj; and used that 

det K- X K + = det(I - K^K^) = de^Kr 1 !^ (308) 
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can be calculated in the space of test data x. This follows from K = K + — K z 
and the equality 

det ^ 1 ~ B A \y det(l - A) (309) 

with A = I X K.~ 1 ~K X , B = (I — I 9 .)K _1 K a ., and I x denoting the projector into 
the space of test data x. Finally 

K, = K, - K.K; 1 ^ = K^K^K = (K - KK^K), (310) 

yields the correct normalization of the predictive density 

P (y\x, D,D ) = e -h{y-y^y(y-y))H^(Ky/^)^ (311) 
with mean and covariance 

y = t = K- 1 ^K l t i , (312) 

i=0 

K^ 1 = (K.-K.K; 1 ^)" 1 =K; 1 + I x .K- 1 I x .. (313) 

It is useful to express the posterior covariance K 1 by the prior covariance 
Kq . According to 

with A = K d Kq}j D , B = K D K~* DB , and K^ DD = l D Ko l l D , K~j, s = 
IdKq 1 !^, I d = I - I D we find 

K 1 = Kq^I + KdKo 1 )" 1 (315) 
= K 1 (jl D + KdJS^o)" 1 - (l D + KflKsJJ" 1 K D K^ DD + I 5 ) . 

Notice that while K^ 1 = (I^K^I^) -1 in general K ^ D = Id Kg 1 Id ^ 
(IdKqId) -1 . This means for example that Kq 1 has to be known to find 
Kq£)£> and it is not enough to invert IdK Id = K 0) dd 7^ (Kqijd) -1 - m 

data space (l D + K D K = (K^, 1 + K^ D ) K^, 1 , so Eq. ( [3151) can 
be manipulated to give 

K- 1 = K 1 (i - I D (K, 1 + K-^) 1 I D K- ^ . (316) 
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This allows now to express the predictive mean ( |312| ) and covariance 
by the prior covariance 



y = t + K" 1 (K^ 1 + K ^) (t D -t ), (317) 
K; 1 = K. + K^-K^K^+K^)" 1 ^. (318) 



Thus, for given prior covariance K 1 both, y and 1 , can be calculated by 

inverting the n x n matrix K = ^Kq£ D + 

Comparison of Eqs . ( |3 1 7] , |3 1 8| ) with the maximum posterior solution h* of 
Eq. ( j274j ) now shows that for Gaussian regression the exact predictive density 
p(y\x, D, Do) and its maximum posterior approximation p(y\x, h*) have the 
same mean 

t= dyyp(y\x,D,D ) = dyyp{y\x,h*). (319) 



The variances, however, differ by the term I X K _1 I X .. 

According to the results of Section pi. 2. 2| the mean of the predictive density 
is the optimal choice under squared-error loss fl5T|). For Gaussian regression, 
therefore the optimal regression function a* (x) is the same for squared-error 
loss in exact and in maximum posterior treatment and thus also for log-loss 
(for Gaussian p(y\x, a) with fixed variance) 

a MPA,log = a exact,log = a MPA,sq. = fl exact,sq. = ^ = ^- (320) 

In case the space of possible p(y\x,a) is not restricted to Gaussian densi- 
ties with fixed variance, the variance of the optimal density under log-loss 
p(y\x, cCxactjog) = v(y\ x i D, D ) differs by I^K -1 ^ from its maximum poste- 
rior approximation p(y\x, a^pA^g) = P(y\ x , h*). 

3.7.3 Gaussian mixture regression (cluster regression) 

Generalizing Gaussian regression the likelihoods may be modeled by a mix- 
ture of m Gaussians 

p(y\x, h) = U P[k)€ g , (321) 

JdyETp(k)e-^- h ^ 2 

where the normalization factor is found as J2kP(k) (■£-) 2 ■ Hence, h is here 
specified by mixing coefficients p(k) and a vector of regression functions hk(x) 
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specifying the x-dependent location of the kth cluster centroid of the mixture 
model. A simple prior for h k {x) is a smoothness prior diagonal in the cluster 
components. As any density p(y\x,h) can be approximated arbitrarily well 
by a mixture with large enough m such cluster regression models allows to 
interpolate between Gaussian regression and more flexible density estimation. 
The posterior density becomes for independent data 

pWd.d,^ ASK^ (322) 



P(VD\X D ,D ) ; E „, pW (0 



Maximizing that posterior is — for fixed x, uniform p(k) and p(h\Do) - 
equivalent to the clustering approach of Rose, Gurewitz, and Fox for squared 
distance costs [|199 . 



3.7.4 Support vector machines and regression 

Expanding the regression function h{x) in a basis of eigenfunctions ^ of K 
K = ]T A fc h{x) = J2 n k ^k(x) (323) 

k k 



yields for functional (246) 



i \ k / k 



(324) 



Under the assumption of output noise for training data the data terms may 
for example be replaced by the logarithm of a mixture of Gaussians. Such 
mixture functions with varying mean can develop flat regions where the error 
is insensitive (robust) to changes of h. Analogously, Gaussians with varying 
mean can be added to obtain errors which are flat compared to Gaussians 
for large absolute errors. Similarly to such Gaussian mixtures the mean- 
square error data term (yi — h(xi)) 2 may be replaced by an e-insensitive 
error \yi — h(xi)\ e , which is zero for absolute errors smaller e and linear for 
larger absolute errors (see Fig.||). This results in a quadratic programming 
problem and is equivalent to Vapnik's support vector machine | |22(J| , [74] , [22 1| , 



\209[ |210| , |4£fl . For a more detailed discussion of the relation between support 



vector machines and Gaussian processes see ||224|, |2U3 . 
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u w 

Figure 5: Three robust error functions which are insensitive to small errors. 
Left: Logarithm of mixture with two Gaussians with equal variance and 
different means. Middle: Logarithm of mixture with 11 Gaussians with equal 
variance and different means. Right: e-insensitive error. 




3.8 Classification 

In classification (or pattern recognition) tasks the independent visible vari- 
able y takes discrete values (group, cluster or pattern labels) [[U], |61], ^4], |47 . 
We write y = k and p(y\x, h) = P k (x,h), i.e., J2k Pk(x, h) = 1. Having re- 
ceived classification data D = {(xi,ki)\l < i < n} the density estimation 
error functional for a prior on function <fi (with components <p k and P = 
P ((/))) reads 

n i 

E cl . = J2 lni\(x 4 ; 0) + -U - t, K (0 - t)) + (P (</>), A x ). (325) 

In classification the scalar product corresponds to an integral over x and a 
summation over k, e.g., 

-t, K(0-t)J = / dxdx'((f) k (x) - t k (x))K k ^(x, x')((f) k/ (x') -t k >(x')), 

k,k' 

(326) 

and (P,A X ) = JdxA x (x)J2kPk(x). 

For zero-one loss l(x, k, a) = 8 kja M — a typical loss function for classifi- 
cation problems — the optimal decision (or Bayes classifier) is given by the 
mode of the predictive density (see Section |2.2.2| ), i.e., 

a(x) = &rgmax k p(k\x, D, D ). (327) 

In saddle point approximation p(k\x, D, Dq) ~p(/c|x, </>*) where 0* minimiz- 
ing E c i((j)) can be found by solving the stationarity equation ( |227|) . 

For the choice (f) k = P k non-negativity and normalization must be en- 
sured. For cf) = L with P = e L non-negativity is automatically fulfilled but 
the Lagrange multiplier must be included to ensure normalization. 
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likelihood p(y\x, h) 


problem type 


of general form 


density estimation 


discrete y 


classification 


Gaussian with fixed variance 


regression 


mixture of Gaussians 


clustering 


quantum mechanical likelihood 


inverse quantum mechanics 



Table 3: Special cases of density estimation 



Normalization is guaranteed by using unnormalized probabilities <pk = Zk, 
P — z k/ (for which non-negativity has to be checked) or shifted log- 
likelihoods (pk = 9k with gk = Lk+\nJ2i e Li , i.e., Pk = e 9k / J2i e 9i . In that case 
the nonlocal normalization terms are part of the likelihood and no Lagrange 
multiplier has to be used ||231|| . The resulting equation can be solved in the 
space defined by the X-data (see Eq. ( |153j )). The restriction of cpk = gk to 
linear functions (pk (x) = WkX + bk yields log-linear models ||152|| . Recently 



a mean field theory for Gaussian Process classification has been developed 

EH Eli- 
Table |3] lists some special cases of density estimation. The last line of the 
table, referring to inverse quantum mechanics, will be discussed in the next 
section. 



3.9 Inverse quantum mechanics 

Up to now we have formulated the learning problem in terms of a function 
having a simple, e.g., pointwise, relation to P. Nonlocalities in the relation 
between and P was only due to the normalization condition, or, working 
with the distribution function, due to an integration. Inverse problems for 
quantum mechanical systems provide examples of more complicated, nonlocal 
relations between likelihoods p(y\x, h) = p(y\x, <p) and the hidden variables (j> 
the theory is formulated in. To show the flexibility of Bayesian Field Theory 
we will give in the following a short introduction to its application to inverse 
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quantum mechanics. A more detailed discussion of inverse quantum problems 
including numerical applications can be found in ||132| , |142| , |141| > [L37], |217| . 



The state of a quantum mechanical systems can be completely described 
by giving its density operator p. The density operator of a specific system 
depends on its preparation and its Hamiltonian, governing the time evolution 
of the system. The inverse problem of quantum mechanics consists in the 
reconstruction of p from observational data. Typically, one studies systems 
with identical preparation but differing Hamiltonians. Consider for example 
Hamiltonians of the form H = T + V, consisting of a kinetic energy part 
T and a potential V. Assuming the kinetic energy to be fixed, the inverse 
problem is that of reconstructing the potential V from measurements. A 
local potential ~V(y, y') = V(y)S(y — y') is specified by a function V(y). Thus, 
for reconstructing a local potential it is the function V(y) which determines 
the likelihood p(y\x,h) = p(y|X,p) = p(y\~K, V) = P(4>) and it is natural 
to formulate the prior in terms of the function <fi = V . The possibilities of 
implementing prior information for V are similar to those we discuss in this 
paper for general density estimation problems. It is the likelihood model 
where inverse quantum mechanics differs from general density estimation. 

Measuring quantum systems the variable x corresponds to a hermitian 
operator X. The possible outcomes y of measurements are given by the 
eigenvalues of X, i.e., 

X|y >= y\y >, (328) 

where \y >, with dual < y\, denotes the eigenfunction with eigenvalue y. (For 
the sake of simplicity we assume nondegenerate eigenvalues, the generaliza- 
tion to the degenerate case being straightforward.) Defining the projector 

Il^ y = \y><y\ (329) 

the likelihood model of quantum mechanics is given by 

p(y\x,p) = Ti(Tl Xiy p). (330) 

In the simplest case, where the system is in a pure state, say the ground 
state y?o of H fulfilling 

H\<p >= E \ip >, (331) 

the density operator is 

p = p 2 = |y2 ><v?o|, (332) 
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P 


general pure state 


|V >< VI 


stationary pure state 


|^(H)x^(H)| 


ground state 


MH)| ><^o(H)| 


time-dependent pure state 


|U(MoMto) ><U(Mo)^o)| 


scattering 


lim t^oo |U(t,toM*o) >< U{t,t )i){t )\ 

tn — ' — oo 


general mixture state 


EkP(k) \ipk >< 4>k\ 


stationary mixture state 


E,p(*|H) |^(H)x^(H)| 


canonical ensemble 


(Tre-' 3H )- 1 e-^ H 



Table 4: The most common examples of density operators for quantum 
systems. In this table V denotes an arbitrary pure state, fi represents an 
eigenstate of Hamiltonian H. The unitary time evolution operator for a 
time-independent Hamiltonian H is given by U = e _ *(*~*°) H . In scattering 
one imposes typically additional specific boundary conditions on the initial 
and final states. 

and the likelihood ( |33U| ) becomes 

p(y\x,h)=p(y\X,p)=Tr(\<p >«p \y><y\) = \Mv)\ 2 - (333) 

Other common choices for p are shown in Table [|. 

In contrast to ideal measurements on classical systems, quantum mea- 
surements change the state of the system. Thus, in case one is interested 
in repeated measurements for the same p, that density operator has to be 
prepared before each measurement. For a stationary state at finite tempera- 
ture, for example, this can be achieved by waiting until the system is again 
in thermal equilibrium. 

For a Maximum A Posteriori Approximation the functional derivative of 
the likelihood is needed. Thus, for reconstructing a local potential we have 
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to calculate 

5 v{y) p{ yi \X,V). (334) 

To be specific, let us assume we measure particle coordinates, meaning we 
have chosen X to be the coordinate operator. For a system prepared in the 
ground state of its Hamiltonian H, we then have to find, 

Sv(y)\¥o(yi)\ 2 - (335) 

For that purpose, we take the functional derivative of Eq. ( |33 1| ) , which yields 

(H - E ) \8 V (y)<p >= (5 V (y)H - 6 V (j,)E )\(p > . (336) 

Projecting from the left by <y?o| ; using again Eq. ( |331D and the fact that for 
a local potential 5v( y )tt(y', y") = S(y — y')S(y' — y"), shows that 

5 v(y )E =<^ |5y(j / )H|v9 >= \Vo(y)\ 2 - (337) 

Choosing <ip \Sv( y )fo> = and inserting a complete basis of eigenfunctions 
\<fj > of H, we end up with 

<W)¥>o(jk) = E j-i 1 j-i <Pj(yi)<p*j(y)My)- (338) 

From this the functional derivative of the quantum mechanical log-likelihood 
( |335|) corresponding to data point yi can be obtained easily, 

&v(y) hip(yi\X., V) = 2Re ((po(y i )~ 1 6 v{y) ipo(y l )^ . (339) 

The MAP equations for inverse quantum mechanics are obtained by includ- 
ing the functional derivatives of the prior term for V . In particular, for a 
Gaussian prior with mean Vq and inverse covariance Ky, acting in the space 
of potential functions V(y), its negative logarithm, i.e., its prior error func- 
tional, reads 

^(V-V , K v (V-V ))+\nZ v , (340) 

with Zy being the ^-independent constant normalizing the prior over V. 
Collecting likelihood and prior terms, the stationarity equation finally be- 
comes 

= E 5 V(y) Mj/ilX, V) - K v (V - Vq). (341) 

i 
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The Bayesian approach to inverse quantum problems is quite flexible and 
can be used for many different learning scenarios and quantum systems. By 
adapting Eq. ( |339| ), it can deal with measurements of different observables, 
for example, coordinates, momenta, energies, and with other density oper- 
ators, describing, for example, time-dependent states or systems at finite 
temperature [|142|| . 



The treatment of bound state or scattering problems for quantum many- 
body systems requires additional approximations. Common are, for example, 
mean field methods, for bound state problems |55|, |193| , |27 | as well as for 
scattering theory f7|, |27|, [139], [H0|, [139], [130], gT|] Referring to such mean 



field methods inverse quantum problems can also be treated for many-body 
systems [|141| . 



4 Parameterizing likelihoods: Variational 
methods 

4.1 General parameterizations 

Approximate solutions of the error minimization problem are obtained by 
restricting the search (trial) space for h(x, y) = 0(x, y) (or h(x) in regression). 
Functions which are in the considered search space are called trial functions. 
Solving a minimization problem in some restricted trial space is also called a 
variational approach |)7|, |106| , |29| , j36[ , p7| . Clearly, minimal values obtained 



by minimization within a trial space can only be larger or equal than the true 
minimal value, and from two variational approximations that with smaller 
error is the better one. 

Alternatively, using parameterized functions <fi can also implement the 
prior where is known to have that specific parameterized form. (In cases 
where <fi is only known to be approximately of a specific parameterized form, 
this should ideally be implemented using a prior with a parameterized tem- 
plate and the parameters be treated as hyperparameters as in Section ^.) 
The following discussion holds for both interpretations. 

Any parameterization = 0({^}) together with a range of allowed values 
for the parameter vector £ defines a possible trial space. Hence we consider 
the error functional 

E m = -(lnP(£), N) + K0(O ) + (P(0, Ax), (342) 
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for depending on parameters £ and = p( </>(£) ). In the special case of 
Gaussian regression this reads 

E H0 = ~( -t D ,K D h(0 -t D ) + ^(h(C), K h(0 ). (343) 
Defining the matrix 

*'(/;x,,) = %^ (344) 

the stationarity equation for the functional (|342|) becomes 

= ^'P'P- 1 ^ - $'K0 - $'P'A X . (345) 

Similarly, a parameterized functional £7^ with non-zero template t as in (|225|) 
would give 

= ^'P'P^N- $'K(0-t) - $'P'A X . (346) 
To have a convenient notation when solving for Ax we introduce 

P^ = $'(OP'(0), (347) 

i.e., 

dP(x,y) f ,d(j)(x',y') SP(x,y) 
''a' ••'•//? = ~ 

and 

to obtain for Eq. (|345|) 



? #6 J o£t 6(f)(x',y') 

G m = P'^P-'N - <4>'K0, (349) 



P'^Ax = G m . (350) 

For a parameterization £ restricting the space of possible P the matrix P^ is 
not square and cannot be inverted. Thus, let (P^)* be the Moore-Penrose 
inverse of P£, i.e., 

(Pe) # Pe(Pe) # = K P^)^ = (P' ? ) # , (351) 

and symmetric (P^)*P^ and P^(P^)*. A solution for Ax exists if 

P£(Pe) # GW) = G m . (352) 
In that case the solution can be written 

Ax = (P't)*G m + V A - (P^) # P^ A , (353) 
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with arbitrary vector V\ and 

A° X = V A - (P^) # P^ A (354) 
from the right null space of P^, representing a solution of 

P'^A° X = 0. (355) 



Inserting for Ax(x) ^ Eq. ( [353| ) into the normalization condition Ax = 
IxPAx gives 

Ax = IxP ((Pd*Gm) + " (P' 5 ) # P^a) • (356) 

Substituting back in Eq. ( |345|) Ax is eliminated yielding as stationarity equa- 
tion 

= (I - P' e IxP(P^) # ) G m - P^IxP (V A - (P^P'^a) , (357) 



where has to fulfill Eq. ( |352|) . Eq. ( |357| ) may be written in a form 

similar to Eq. (|192|) 



K m (£)=T m (358) 

with 

T^ ) (0=P^P" 1 iV-P , 5 Ax, (359) 

but with 

K^)(0 = $'K$(0, (360) 
being in general a nonlinear operator. 

4.2 Gaussian priors for parameters 

Up to now we assumed the prior to be given for a function 4>{£,){x,y) de- 
pending on x and y. Instead of a prior in a function </>(£) (x, y) also a prior in 
another not (x,y) -dependent function of the parameters ^(0 can be given. 
A Gaussian prior in ip(£) = being a linear function of £, results in a 
prior which is also Gaussian in the parameters £, giving a regularization term 

i(£ W^W^)= l -U, KfO, (361) 
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where = WlK-^W^ is not an operator in a space of functions (f>(x, y) but 
a matrix in the space of parameters £. The results of Section [O] apply to 
this case provided the following replacement is made 

$'K0 -> K c< £. (362) 

Similarly, a nonlinear if) requires the replacement 

$'K0 -> #'K^, (363) 

where 

*'(M) = ^- (364) 

Thus, in the general case where a Gaussian (specific) prior in 0(0 and ^/>(0 
is given, 

= -(lnP(£), AO + (P(£),Ax) 

+i( 0(0, K 0(0 ) + I( ^(0, ^(0 ), (365) 

or, including also non-zero template functions (means) t, for and if) as 
discussed in Section |3l 



fl^ (0 = -(lnP(0,iV) + (P(0,A x ) 

+i( 0(0 -f,K (0(0-0) 

+^(V'(0-^>K V) ^(0-^))- (366) 

The and -0-terms of the energy can be interpreted as corresponding to 
a probability p(£\t, K, K^), p(£|t,K) p(£|t^, K^)), or, for example, 
to p(t^|0K^) K) with one of the two terms term corresponding to a 

Gaussian likelihood with ^-independent normalization. 
The stationarity equation becomes 

= P^p- 1 iV-$ / K(0-t)-^ , K^(V-^)-P^Ax (367) 
= G^-PjAx, (368) 

which defines G^, and for Ax ^ 

Ax = I*P ((P*) # G^ + A x ) , (369) 



for P^A X = 0. 
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Variable 


Error 


Stationarity equation 


Ax 


L(x,y) 


E L 


KL = N - e L A x 


I x (N - KL) 


P(x,y) 


E P 


KP = P^N-Ax 


Ix(N-PKP) 


<P = Vp 


E W 


= 2$~ 1 iV-2$A x 


I x (iV-i$K0) 


<j)(x,y) 


E<f, 


K(f) = P'P- 1 ^ - P'A X 


I x (N- PP /_1 K0) 






$'K0 = P^P- 1 ^ - P^A X 


IxP((P^) # G^ )+ A x ) 






$'K(0-t)+*'K^(V-^) 
= P^P- 1 ^ - P'^Ax 


IxP ((P't)*G M + A x ) 



Table 5: Summary of stationarity equations. For notations, conditions and 



3.1.1, 3.2.1 


. 1.3.2 


• 


3.3.1 




4.1 


and 


4.2 



4.3 Linear trial spaces 

Solving a density estimation problem numerically, the function has to be 
discretized. This is done by expanding in a basis B\ (not necessarily 
orthonormal) and, choosing some / max , truncating the sum to terms with 

<P = Y,ciB l ^<j ) =Y J CiB l . (370) 
i=i i=i 

This, also called Ritz's method, corresponds to a finite linear trial space and is 
equivalent to solving a projected stationarity equation. Using a discretization 
(|370|) the functional (|186|) becomes 

£ Ritz = -(lnP(0), N) + W°M E k, KB l ) + (P(0), Ax). (371) 

1 ki 

Solving for the coefficients cj, I < / max to minimize the error results according 
to Eq. [|34"5l ) and 

&(l;x,y)=B l (x,y), (372) 

in 



= ( B h PP 1 N)-J2ck(B h KB k )-( B h P' A x ), V/ < Z max , (373) 

k 
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corresponding to the Z max -dimensional equation 

K B c = N B (c)-A B (c), (374) 



with 



c(l) = ci, (375) 

K B (l,k) = (B h KB k ), (376) 

N B (c)(l) = (BuF'i^P-^c^N), (377) 

A fl (c)(0 = {B h P'(0(c))A x ). (378) 



Thus, for an orthonormal basis B[ Eq. ( |374| ) corresponds to Eq. ( |188 ) pro- 
jected into the trial space by J]/ -£>; T £>2- 

The so called linear models are obtained by the (very restrictive) choice 

i 

</>(*) =E c ^ = c o + E c ^ (379) 

1=0 i 

with z = (x, y) and B = 1 and Bi — zi. Interactions, i.e., terms proportional 
to products of z-components like c mn z m z n can be included. Including all pos- 
sible interaction would correspond to a multidimensional Taylor expansion 
of the function 0(z). 

If the functions Bi(z) are also parameterized this leads to mixture models 
for 0. (See Section |4~4|.) 

4.4 Mixture models 

The function <p(z) can be approximated by a mixture model, i.e., by a linear 
combination of components functions 

0(*)=5>fl,(&,«), (380) 

with parameter vectors £j and constants q (which could also be included 
into the vector £/) to be adapted. The functions Bi(£i, z) are often chosen to 
depend on one-dimensional combinations of the vectors £j and z. For example 
they may depend on some distance \\^i — z\\ ('local or distance approaches') 
or the projection of z in ^-direction, i.e., J2k£i,k z k ('projection approaches'). 
(For projection approaches see also Sections fO|, fO| and |4.9|) . 
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A typical example are Radial Basis Functions (RBF) using Gaussian 
Bi(£i, z) for which centers (and possibly covariances and also number of com- 
ponents) can be adjusted. Other local methods include fc-nearest neighbors 
methods (fcNN) and learning vector quantizations (LVQ) and its variants. 
(For a comparison see ||155|| .) 



4.5 Additive models 



Trial functions may be chosen as sum of simpler functions (pi each depending 
only on part of the x and y variables. More precisely, we consider functions 
4>i depending on projections z\ = I^z of the vector z = (x,y) of all x and 
y components. l[ denotes an projector in the vector space of z (and not in 
the space of functions Q(x,y)). Hence, (f) becomes of the form 



0(*) = E^(zi)> ( 381 ) 
i 

so only one-dimensional functions <pi have to be determined. Restricting 
the functions <pi to a parameterized function space yields a "parameterized 
additive model" 

= (382) 
i 

which has to be solved for the parameters £. The model can also be gener- 
alized to a model "additive in parameters 

0(*)=5>,(6,*,J/), (383) 
i 

where the functions </>/(£;, x, y) are not restricted to one-dimensional functions 
depending only on projections zi on the coordinate axes. If the parameters £z 
determine the component functions <pi completely, this yields just the mixture 



models of Section [4.4| . Another example is projection pursuit, discussed in 
Section |4.8| ), where a parameter vector £j corresponds to a projections £/ • z. 
In that case even for given still a one-dimensional function <fii(£i ■ z) has 
to be determined. 



An ansatz like Q381 ) is made more flexible by including also interactions 



4>(x, y) = 4>i(zi) + Y 4>ki(zk, zi) + E 4>kim(zk, zi, z m ) H . (384) 



kl klm 
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The functions 4>ki-( z k, z h ' ' ') can De chosen to depend on product terms like 
Zi,iZk,ji or zi,iZk,j z m,m where zij denotes one-dimensional sub-variables of Z[. 
In additive models in the narrower sense ||213| , |92| , |93| , zi is a subset of 



x, y components, i.e., z\ C < i < d x } U < j < rf x denoting 

the dimension of x, <i y the dimension of y. In regression, for example, one 
takes usually the one-element subsets Z\ = {xi} for 1 < I < d x . 

In more general schemes the projections of z do not have to be restricted 
to projections on the coordinates axes. In particular, the projections can be 
optimized too. For example, one-dimensional projections ij f z = w ■ z with 
z,wEXxY (where • denotes a scalar product in the space of z variables) are 
used by ridge approximation schemes. They include for regression problems 
one-layer (and similarly multilayer) feedforward neural networks (see Section 
PD projection pursuit regression (see Section [18]) and hinge functions |31 



For a detailed discussion of the regression case see |75 



The stationarity equation for becomes for the ansatz ( |381| ) 

= VjP- x N - K0 - PjAx, (385) 

with 

0(pl{Zi) 

Considering a density P being also decomposed into components Pi deter- 
mined by the components 4>i 

P(z)=J2Pi(Mzi)), (387) 
i 



the derivative (13861) becomes 



0<Pl{Zi) 

so that specifying an additive prior 

K w (0i-ti))> (389) 

1 ki 

the stationary conditions are coupled equations for the component functions 
<pi which, because P is diagonal, only contain integrations over z^-variables 

= j^P-'N - J2 K ifc (0 fc - t k ) - °-±A x . (390) 
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For the parameterized approach ( |382| ) one finds 

o = ^p'p^n - $;k0 - $;p;ax, (391) 

with 

d(j>i{zi) 

®i{k,zi)= r, ■ (392) 
For the ansatz ( |383| ) $;(&, z) would be restricted to a subset of 

4.6 Product ansatz 

A product ansatz has the form 

<P(z) =n<M^), (393) 

where z\ = Ipz represents projections of the vector z consisting of all x 
and y components. The ansatz can be made more flexible by using sum of 
products 

^)=EIIM4 (394) 

k I 

The restriction of the trial space to product functions corresponds to the 
Hartree approximation in physics. (In a Hartree-Fock approximation the 
product functions are antisymmetrized under coordinate exchange.) 

For additive K = J2i with acting only on (pi, i.e., K; = <g> 
(^[t^li'^j, with I; the projector into the space of functions (pi = li4>i, the 
quadratic regularization term becomes, assuming 1/ I// = 8^, 

( 0, K <f> ) = (f> h K, ft ) J] ( <t>i', <f>i> )• (395) 
i i'^i 

For K = ® l with a product structure with respect to (pi 

(0, K0) = ;Q(0hK^). (396) 

In both cases the prior term factorizes into lower dimensional contributions. 
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4.7 Decision trees 



Decision trees |32| implement functions which are piecewise constant on rect- 
angular areas parallel to the coordinate axes z\. Such an approach can be 
written in tree structure with nodes only performing comparisons of the form 
x < a or x > a which allows a very effective hardware implementation. Such 
a piecewise constant approach can be written in the form 

0(2) = J2 c iT[®( z Hi,k) - a ik) (397) 

l k 

with step function and 2Wj,fc) indicating the component of z which is com- 
pared with the reference value a^. While there are effective constructive 
methods to build trees the use of gradient-based minimization or maximiza- 
tion methods would require, for example, to replace the step function by a 
sigmoid. In particular, decision trees correspond to neural networks at zero 
temperature, where sigmoids become step functions, and which are restricted 



to weights vectors in coordinate directions (see Section |4.9|) . 



An overview over different variants of decision trees together with a com- 



parison with rule-based systems, neural networks (see Section |4.9|) techniques 
from applied statistics like linear discriminants, projection pursuit (see Sec- 
tion |4.8|) and local methods like for example fc-nearest neighbors methods 



(fcNN), Radial Basis Functions (RBF), or learning vector quantization (LVQ) 



is given in 155]. 



4.8 Projection pursuit 

Projection pursuit models |K| |102| , [50| are a generalization of additive models 
(|381|) (and a special case of models ( |383|) additive in parameters) where the 
projections of z — (x, y) are also adapted 

0(*) = 6> + £&(6>,J + &•*)■ (398) 
I 

For such a model one has to determine one-dimensional 'ridge' functions (pi 
together with projections defined by vectors £j and constants £05 Co,;- Adap- 
tive projections may also be used for product approaches 

1 
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Similarly, may be decomposed into functions depending on distances to 
adapted reference points (centers). That gives models of the form 



#*) = IIMIIG-*II)> ( 40 °) 
i 

which require to adapt parameter vectors (centers) and distance functions 
(pi. For high dimensional spaces the number of centers necessary to cover a 
high dimensional space with fixed density grows exponentially Furthermore, 
as the volume of a high dimensional sphere tends to be concentrated near 
its surface, the tails become more important in higher dimensions. Thus, 
typically, projection methods are better suited for high dimensional spaces 



than distance methods 206 



4.9 Neural networks 

While in projection pursuit-like techniques the one-dimensional 'ridge' func- 
tions <pi are adapted optimally, neural networks use ridge functions of a fixed 
sigmoidal form. The resulting lower flexibility following from fixing the ridge 
function is then compensated by iterating this parameterization. This leads 
to multilayer neural networks. 

Multilayer neural networks have been become a popular tool for regres- 



sion and classification problems p0j, P"H PE M pM p26L EM ITM WO 



One-layer neural networks, also known as perceptrons, correspond to the 
parameterization 

4>{z) =<t\Y, wtz t -bj= <t(v), (401) 

with a sigmoidal function a, parameters £ = w, projection v = J2i w i z i ~ b 
and zi single components of the variables x, y, i.e., Z\ = xi for 1 < I < d x 
and z\ = yi for d x + 1 < I < d x + d y . (For neural networks with Lorentzians 
instead of sigmoids see f72fl .) 

Typical choices for the sigmoid are a{y) = tanh(/3f) or cr(v) = 1/(1 + 
e -2pvy ^p^g p arame ter /3, often called inverse temperature, controls the 
sharpness of the step of the sigmoid. In particular, the sigmoid functions 
become a sharp step in the limit (3 — > oo, i.e., at zero temperature. In princi- 
ple the sigmoidal function o may depend on further parameters which then 
- similar to projection pursuit discussed in Section ^]8| — would also have 
to be included in the optimization process. The threshold or bias b can be 
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treated as weight if an additional input component is included clamped to 
the value 1. 

A linear combination of perceptrons 



0(z,y) = & + £W^(X>, fc *fc-&*), (402) 

l V k ) 



has the form of a projection pursuit approach ( p98| ) but with fixed <pi{v) = 
W«t(v). 

In multi-layer networks the parameterization (|401|) is cascaded, 

(rrn-i \ 
J2 w M,iZi,i-i ~ h,i)J = cr(v k ,i), (403) 

with Zk,i representing the output of the fcth node (neuron) in layer i and 

Vk,i = Y w M,iZi,i-i - h,i, (404) 
i=i 

being the input for that node. This yields, skipping the bias terms for sim- 
plicity 

(mn-l / "in-2 / m 

Y Wl n -i,nCr{ W/ n _ lln _ 3lfl _i • • ■ <T I X>iiio,l*2o,0 

in-l \ (n-2 \ 'o 

(405) 

beginning with an input layer with m = d x + rf y nodes (plus possibly nodes 
to implement the bias) zi t o = z\ and going over intermediate layers with 
nodes zij, < i < n, 1 < I < rrii to a single node output layer £ n = 0(x, y). 

Commonly neural nets are used in regression and classification to param- 
eterize a function 0(x, y) = h(x) in functionals 

E = Y t {y i -h(x i ,w)) 2 , (406) 

i 

quadratic in h and without further regularization terms. In that case, regu- 
larization has to be assured by using either 1. a neural network architecture 
which is restrictive enough, 2. by using early stopping like training procedures 
so the full flexibility of the network structure cannot completely develop and 
destroy generalization, where in both cases the optimal architecture or al- 
gorithm can be determined for example by cross-validation or bootstrap 
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techniques |T^|, §, g2§ Q fjg, 

ensembles of networks 



|2"2"3l 



or 3. by averaging over 



167]. 



In all these cases regularization is implicit in 
the parameterization of the network. Alternatively, explicit regularization or 
prior terms can be added to the functional. For regression or classification 
this is for example done in learning by hints || ||, |J or curvature-driven 
smoothing with feedforward networks |22| . 

One may also remark that from a Frequentist point of view the quadratic 
functional is not interpreted as posterior but as squared-error loss J2i(yi — 
a(x{, w)) 2 for actions a(x) = a(x, w). According to Section ^.2.2| minimization 
of error functional ( |406| ) for data {(xj, < i < n} sampled under the true 
density p(x,y\f) yields therefore an empirical estimate for the regression 
function Jdyyp(y\x, f). 

We consider here neural nets as parameterizations for density estimation 
with prior (and normalization) terms explicitly included in the functional E^. 



In particular, the stationarity equation for functional ( |342| ) becomes 



= ^P'P^A - $' K</> - ^P'Ajc, 



(407) 



with matrix of derivatives 

d(j)(x,y,w) 



& w (k,l,i;x,y) 



dw 



ki, 



(408) 



ln—1 In — 2 

i+2k+l, i+2°~' (Vl i+ i,i+l)wi i+1 k,i+l&' ( v h,i) z l,i-l, 



h + l 

and cr'iy) = da{y)/dv. While (p(x,y,w) is calculated by forward propagating 
z = (x, y) through the net defined by weight vector w according to Eq. ( |405| ) 



the derivatives $' can efficiently be calculated by back-propagation according 
to Eq. ( |408| ) . Notice that even for diagonal P' the derivatives are not needed 
only at data points but the prior and normalization term require derivatives 
at all x, y. Thus, in practice terms like $'K0 have to be calculated in a 
relatively poor discretization. Notice, however, that regularization is here 
not only due to the prior term but follows also from the restrictions implicit 
in a chosen neural network architecture. In many practical relatively 
poor discretization of the prior term may thus be sufficient. 
Table summarizes the discussed approaches. 
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Ansatz 


Functional form 


to be optimized 


linear ansatz 




Cz 


linear model 
with interaction 


<t>{z) = Co + Ez 


Co, Cz 

Cmn; 


mixture model 


0(*) = ££o,i£i(6,*) 


Co,z, Cz 


additive model 
with interaction 


0(^) = Ei<f>i( z i) 




product ansatz 


<t>{z) = UiM z i) 




decision trees 


<f>(z) = Ei Cz rifc @(^ ife - Co,zfe) 


Cz? Co,zz« ; Czfc 


projection pursuit 


0(2) = Co + Ez <MCo,z + Ez Cz^z) 


0z, Co ; Co,z> Cz 


neural net (2 lay.) 


0(2) = o-(EzCz^(EfeCzfc^)) 


CZ) Czfc 



Table 6: Some possible parameterizations. 
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5 Parameterizing priors: Hyperparameters 



5.1 Prior normalization 

In Chapter [|. parameterization of have been studied. This section now 
discusses parameterizations of the prior density p(<p\Do). For Gaussian prior 
densities that means parameterization of mean and/or covariance. The pa- 
rameters of the prior functional, which we will denote by 9, are in a Bayesian 
context also known as hyperparameters. Hyperparameters 9 can be consid- 
ered as part of the hidden variables. 

In a full Bayesian approach the /z-integral therefore has to be completed 
by an integral over the additional hidden variables 9. Analogously, the prior 
densities can be supplemented by priors for 9, also be called hyperpriors, with 
corresponding energies Eg. 

In saddle point approximation thus an additional stationarity equation 
will appear, resulting from the derivative with respect to 9. The saddle point 
approximation of the ^-integration (in the case of uniform hyperprior p(9) 
and with the /i-integral being calculated exactly or by approximation) is also 
known as ML-II prior or evidence framework |86| , [208| , |146| , |147| , |148| , 



There are some cases where it is convenient to let the likelihood p(y\x, h) 
depend, besides on a function 0, on a few additional parameters. In regres- 
sion such a parameter can be the variance of the likelihood. Another example 
is the inverse temperature (3 introduced in Section which, like also ap- 
pears in the prior. Such parameters may formally be added to the "direct" 
hidden variables yielding an enlarged 0. As those "additional likelihood pa- 
rameters" are like other hyperparameters typically just real numbers, and not 
functions like 0, they can often be treated analogously to hyperparameters. 
For example, they may also be determined by cross-validation (see below) or 
by a low dimensional integration. In contrast to pure prior parameters, how- 
ever, the functional derivatives with respect to such "additional likelihood 
parameters" contain terms arising from the derivative of the likelihood. 

Within the Frequentist interpretation of error minimization as empirical 
risk minimization hyperparameters 9 can be determined by minimizing the 
empirical generalization error on a new set of test or validation data being 
independent from the training data D. Here the empirical generalization 
error is meant to be the pure data term Ed{9) = Ejj(<p*(9)) of the error 
functional for 0* being the optimal for the full regularized E<p(9) at 9 and 
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for given training data D. Elaborated techniques include cross-validation 
and bootstrap methods which have been mentioned in Sections |2.5| and [4.9| . 

Within the Bayesian interpretation of error minimization as posterior 
maximization the introduction of hyperparameters leads to a new difficulty 
The problem arises from the fact that it is usually desirable to interpret the 
error term Eg as prior energy for 9, meaning that 

p(9) = (409) 

with normalization 

Z e = Jd9e~ Ee , (410) 

represents the prior density for 9. Because the joint prior factor for and 9 
is given by the product 

p(<j>,9)=p(<f>\9)p(e), (411) 



one finds 



e -E(m 



P{0) = ~^~TffT- ( 412 ) 

Hence, the ^-dependent part of the energy represents a conditional prior 
energy denoted here E(<f>\9). As this conditional normalization 

Z+{6) = Jd<Pe~ E ^ e \ (413) 

is in general ^-dependent a normalization term 

E N (9) = ]nZ^9) (414) 

must therefore be included in the error functional when minimizing with 
respect to 9. 

It is interesting to look what happens if p((f>, 9) of Eq. ([409 ) is expressed 
in terms of joint energy E(<f>, 9) as follows 

-E(<t>,9) 

p{<f>,9) = — . (415) 

Then the joint normalization 

Zm = Jd<pd9e- Ei(t> ' 9 \ (416) 
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is independent of and 9 and could be skipped from the functional. However, 
in that case the term Eg cannot easily be related to the prior p(9). 

Notice especially, that this discussion also applies to the case where Eg 
is assumed to be uniform so it does not have to appear explicitly in the 
error functional. The two ways of expressing 9) by a joint or conditional 
energy, respectively, are equivalent if the joint density factorizes. In that 
case, however, 9 and are independent, so 9 cannot be used to parameterize 
the density of 4>. 

Numerically the need to calculate Z^(9) can be disastrous because nor- 
malization factors Z ^(9) represent often an extremely high dimensional (func- 
tional) integral and are, in contrast to the normalization of P over y, very 
difficult to calculate. 

There are, however, situations for which Z^ifi) remains ^-independent. 
Let p((p,9) stand for example for a Gaussian specific prior p((p, 9\D ) (with 
the normalization condition factored out as in Eq. (0)). Then, because the 
normalization of a Gaussian is independent of its mean, parameterizing the 
mean t = t{9) results in a ^-independent Z^{9). 

Besides their mean, Gaussian processes are characterized by their covari- 
ance operators K _1 . Because the normalization only depends on detK a 
second possibility yielding ^-dependent Z^iff) are parameterized transfor- 
mations of the form K — > OKO 1 with orthogonal O = 0(9). Indeed, 
such transformations do not change the determinant det K. They are only 
non-trivial for multi-dimensional Gaussians. 

For general parameterizations of density estimation problems, however, 
the normalization term In 2^(6?) must be included. The only way to get rid 
of that normalization term would be to assume a compensating hyperprior 



Ee,<f, = -(lnP(0), N) + {P (</>), A x ) + E^{9) + Eg + lnZ^(9). (418) 



writing E(<ft\9) = E^ and E(9) = Eg. The stationarity conditions have the 



p{9) oc z^e), 

resulting in an error term E{9) = — In Z,p(9) compensating E^{9). 
Thus, in the general case we have to consider the functional 



(417) 



form 




(419) 



dE* 



TlZ7\9)-E'g 



(420) 



09 
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with 

Z%k) = 5(l-k)^l, m = ^- (421) 

For compensating hyperprior Eg = — InZ^O) the right hand side of Eq. 
( |420| ) vanishes. 

Finally, we want to remark that in case function evaluation of p(<p, 9) 
is much cheaper than calculating the gradient ( |420| ), minimization methods 
not using the gradient should be considered, like for example the downhill 
simplex method [191]. 



5.2 Adapting prior means 
5.2.1 General considerations 

A prior mean or template function t represents a prototype, reference func- 
tion or base line for 0. It may be a typical expected pattern in time series 
prediction or a reference image in image reconstruction. Consider, for ex- 
ample, the task of completing an image <fi given some pixel values (training 
data) ||136fl . Expecting the image to be that of a face the template function t 
may be chosen to be some prototypical image of a face. We have seen in Sec- 
tion |3.5| that a single template t could be eliminated for Gaussian (specific) 
priors by solving for <p — t instead for <fi. Restricting, however, to only a single 
template may be a very bad choice. Indeed, faces for example appear on im- 
ages in many variations, like in different scales, translated, rotated, various 
illuminations, and other kinds of deformations. We may now describe such 
variations by a family of templates t(9), the parameter 9 describing scaling, 
translations, rotations, and more general deformations. Thus, we expect a 
function to be similar to only one of the templates t{9) and want to imple- 
ment a (soft, probabilistic) OR, approximating t{9\) OR £(#2) OR ■ • • (See 

also B BBIl). 

A (soft, probabilistic) AND of approximation conditions, on the other 
hand, is implemented by adding error terms. For example, classical error 
functionals where data and prior terms are added correspond to an approxi- 
mation of training data AND a priori data. 

Similar considerations apply for model selection. We could for example 
expect to be well approximated by a neural network or a decision tree. 
In that case t(9) spans, for example, a space of neural networks or decision 
trees. Finally, let us emphasize again that the great advantage and practi- 
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cal feasibility of adaptive templates for regression problems comes from the 
fact that no additional normalization terms have to be added to the error 
functional. 



5.2.2 Density estimation 

The general case with adaptive means for Gaussian prior factors and hyper- 
parameter energy Eg yields an error functional 

E e>rj> = -(In P(0), N) + X - (<f> - 1(9), K(4>- 1(6))) + (P(0), A x ) + Eg. (422) 
Defining 

t'((;x,,) = ^M^, (423) 

the stationarity equations of ( |422| ) obtained from the functional derivatives 
with respect to <ft and hyperparameters 9 become 

K(<j)-t) = P'O^p-^iV-P'O^Ax, (424) 
t'K(0-t) = -E' e . (425) 



Inserting Eq. fl424|) in Eq. (|425|) gives 

t'P'{<f>)p- 1 (<f>)N = t'P'i^Ax - E' e . (426) 



Eq.( f426j ) becomes equivalent to the parametric stationarity equation ( 346|) 
with vanishing prior term in the deterministic limit of vanishing prior co- 
variances K -1 , i.e., under the assumption = t(8), and for vanishing E' e . 



Furthermore, a non-vanishing prior term in ( |346| ) can be identified with the 



term Eg. This shows, that parametric methods can be considered as deter- 
ministic limits of (prior mean) hyperparameter approaches. In particular, a 
parametric solution can thus serve as reference template t, to be used within 
a specific prior factor. Similarly, such a parametric solution is a natural 
initial guess for a nonparametric (f) when solving a stationarity equation by 
iteration. 

If working with parameterized </>(£) extra prior terms Gaussian in some 



function can be included as discussed in Section [12|. Then, analogously 
to templates t for <ft, also parameter templates t^ can be made adaptive 
with hyperparameters 9^. Furthermore, prior terms Eg and Eg^ for the 
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hyperparameters 6, 6^ can be added. Including such additional error terms 
yields 



Ee^MMO = -(lnP(0(O),iV) + (P(0(O),Ax; 
1 

2 1 



+E + E ^, (427) 



+ 2 



and Eqs.p24"D and ( pE24] ) change to 



$'K(0-t)+#'K^-^) = P'^N-P'^Ax, (428) 
t'K(0-t) = -E' , (429) 
t;K^(^-^) = (430) 

where t^, E' e , £^ , denote derivatives with respect to the parameters ^ 
or respectively. Parameterizing ^ and E ^ the process of introducing 
hyperparameters can be iterated. 

5.2.3 Unrestricted variation 



To get a first understanding of the approach ( [42 2j ) let us consider the extreme 
example of completely unrestricted t-variations. In that case the template 
function t(x, y) itself represents the hyperparameter. (Such function hyper- 
parameters or hyperfields are also discussed in Sect. |5l|) Then, t' = I and 



Eq. ([425|) gives K(0 — t) = (which for invertible K is solved uniquely by 



t = 0), resulting according to Eq. ( |228| ) in 

Ax = N x . (431) 

The case of a completely free prior mean t is therefore equivalent to a situation 
without prior. Indeed, for invertible P', projection of Eq. ( [426| ) into the x- 
data space by Id of Eq. ( [257f ) yields 

P D = A X ] D N, (432) 

where Ax,d — IdAxId is invertible and Pjj = IpP- Thus for Xi for which 
Hi are available 
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is concentrated on the data points. Comparing this with solutions of Eq. 
( |191|) for fixed t we see that adaptive means tend to lower the influence of 
prior terms. 



5.2.4 Regression 

Consider now the case of regression according to functional (|246|) with an 



adaptive template to(0). The system of stationarity equations for the regres- 
sion function h(x) (corresponding to cj)(x,y)) and 9 becomes 

K {h-t ) = K D {t D -h), (434) 
t' K (h-t ) = 0. (435) 

It will also be useful to insert Eq. ( |434j) in Eq. (|435| ) , yielding 

= t' o K D {h-t D ). (436) 
For fixed t Eq. ( |434| ) is solved by the template average t 

h = t=(K + Kd)- 1 (K t + K D t D ) , (437) 
so that Eq. ( |435| ) or Eq. ( [43 6| ), respectively, become 

= t K (t-t ), (438) 
= t' o K D {t-t D ). (439) 
It is now interesting to note that if we replace in Eq. (|439|) the full template 



average t by t we get 

= t o K D (t o -t D ), (440) 
which is equivalent to the stationarity equation 

= H'K D (h-t D ), (441) 

(the derivative matrix H' being the analogue to $' for h) of an error functional 

Eo, m = \i HO - to, K D (h{£) - t D ) ) (442) 

without prior terms but with parameterized h(£), e.g., a neural network. The 
approximation h = t = t can, for example, be interpreted as limit A — > oo, 

lim h= lim t = t , (443) 

A^oo A^oo 
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after replacing K by AK in Eq. ( [437] ). The setting h = t can then be 
used as initial guess h° for an iterative solution for h. For existing Kg 1 h 
= to is also obtained after one iteration step of the iteration scheme h l = 
t Q + Kq 1 ~Ki)(tD — k 1 " 1 ) starting with initial guess h° = tjj. 

For comparison with Eqs. (}439||440| , [441| ) we give the stationarity equations 
for parameters £ for a parameterized regression functional including an ad- 
ditional prior term with hyperparameters 

Ee,m = liKO-tn, K D {h(0-t D ))+±{h{0-t (e), K o {6){h(O-t o {0))), 

(444) 

which are 

= H'K D (h-t D ) + tiK (h-t ). (445) 
Let us now compare the various regression functionals we have met up to 



now. The non-parameterized and regularized regression functional Eh ( |246| ) 
implements prior information explicitly by a regularization term. 

A parameterized and regularized functional -E'h(g) of the form ( |343| ) cor- 
responds to a functional of the form ( [444j ) for 6 fixed. It imposes restrictions 
on the regression function h in two ways, by choosing a specific parameteri- 
zation and by including an explicit prior term. If the number of data is large 
enough, compared to the flexibility of the parameterization, the data term 
of -EW) alone can have a unique minimum. Then, at least technically, no 
additional prior term would be required. This corresponds to the classical 
error minimization methods used typically for parametric approaches. Nev- 
ertheless, also in such situations the explicit prior term can be useful if it 
implements useful prior knowledge over h. 



The regularized functional with prior- or hyperparameters Eg^ (|422|) im- 



plements, compared to Eh, effectively weaker prior restrictions. The prior 
term corresponds to a soft restriction of h to the space spanned by the pa- 
rameterized t(8). In the limit where the parameterization of t(8) is rich 
enough to allow t{9*) = h* at the stationary point the prior term vanishes 
completely. 

The parameterized and regularized functional Eq^) ( |444|) , including 
prior parameters 9, implements prior information explicitly by a regular- 
ization term and implicitly by the parameterization of h(£). The explicit 
prior term vanishes if t(8*) = h(£*) at the stationary point. The func- 
tional combines a hard restriction of h with respect to the space spanned 
by the parameterization h(£) and a soft restriction of h with respect to the 
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space spanned by the parameterized t{6). Finally, the parameterized and 
non-regularized functional EdmO ( HI ) implements prior information only 
implicitly by parameterizing h(£). In contrast to the functionals Eq^ and 
Ee,h(€> ^ implements only a hard restriction for h. The following table sum- 
marizes the discussion: 



Functional 


Eq. 


prior implemented 


E h 

E Kt) 
Ee,h 

E D Mi) 


(|246) 
0343) 
(F22P 

([444) 


explicitly 

explicitly and implicitly 
explicitly 

no prior for t{6*) = h* 

explicitly and implicitly 

no expl. prior for t(6*) = h(^*) 

implicitly 



5.3 Adapting prior covariances 
5.3.1 General case 

Parameterizing covariances K -1 is often desirable in practice. It includes 
for example adapting the trade-off between data and prior terms (i.e., the 
determination of the regularization factor), the selection between different 
symmetries, smoothness measures, or in the multidimensional situation the 
determination of directions with low variance. As far as the normalization 
depends on K(#) one has to consider the error functional 

E e ^ = -(In P(0), N) + X - (cj> - 1, K(0) (<j) - 1)) + (P ((/>), A x ) + In Z^O) + E e , 

(446) 

with 

^(0) = (27r)5(detK(0))-t, (447) 
for a rf-dimensional Gaussian specific prior, and stationarity equations 

K{<f>-t) = P'{<f>)P- 1 {<f>)N-P , (<f>)Ax, (448) 
l^-^ «-*)) ^(K-V)^f)-^, (449) 
Here we used 

Br) ( <9K\ 

-IndetK = -Tr InK = Tr K- 1 — . (450) 
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In case of an unrestricted variation of the matrix elements of K the hyper- 
parameters become 9i = 9(x,y;x',y') = K.(x,y;x',y'). Then, using 



° KU ' ' ' l/] - S(x - x")5{y - y")S(x' - x"')S(y' - y'"), (451) 



d6(x",y";x"',y"') 
Eqs. (|449|) becomes the inhomogeneous equation 

1(0 - t) (0 - t) T = Tr (k- V)^) " K (452) 

We will in the sequel consider the two special cases where the determinant 
of the covariance is ^-independent so that the trace term vanishes, and where 
9 is just a multiplicative factor for the specific prior energy, i.e., a so called 
regularization parameter. 

5.3.2 Automatic relevance detection 

A useful application of hyperparameters is the identification of sensible di- 
rections within the space of x and y variables. Consider the general case 
of a covariance, decomposed into components K = J2i ^Kj. Treating the 
coefficient vector 9 (with components 9i) as hyperparameter with hyperprior 
p(9) results in a prior energy (error) functional 

i(0-t, (-J20i^i)(<f>-t))-\np(9)+lnZ^e). (453) 

i 

The ^-dependent normalization InZ^) has to be included to obtain the 
correct stationarity condition for 9. The components Kj can be the compo- 
nents of a negative Laplacian, for example, Kj = —d^. or Kj = —dy.. In that 
case adapting the hyperparameters means searching for sensible directions in 
the space of x or y variables. This technique has been called Automatic Rel- 
evance Determination by MacKay and Neal [ 167|| . The positivity constraint 



for a can be implemented explicitly, for example by using K = J2i or 
Ko = Eiexp(0i)Ki. 

5.3.3 Local smoothness adaption 

Similarly, the regularization factor of a smoothness related covariance op- 
erator may be adapted locally. Consider, for example, a prior energy for 
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<f>{x,y) 

E(m = l(4>~t, K(a, b)(<f>-t)), (454) 
with a Laplacian prior 

m x rriy 

K(x, y, y'; 9) = - £ e^'M 5{ Xl - x[) d% - £ e*.'M tf(y - $ flj,, (455) 

i i 

for m x -dimensional vector x and m y -dimensional vector y depending on func- 
tions 9 x ,i(x) and O y ,i(y) (or more general 9 x j(x,y) and 9 y ^(x,y)) collectively 
denoted by 9. Expressing the coefficient functions as exponentials exp(0 X) j), 
exp(9 V) i) is one possibility to enforce their positivity. Typically, one might 
impose a smoothness hyperprior on the functions 9 x ^(x) and O y ,i(y), for ex- 
ample by using an energy functional 

i mi 1 m v 

E(fr9) + ~Y^(9 Xji , K 9:X 9 x ,i) + -J2(9 y ,i, KeJy^+lnZ^O), (456) 

i i 

with smoothness related ~Ke,x, Kg i2/ . The stationarity equation for a functions 
9 x ,i(x) reads 



= (K e ^ i )(x)-(0(a;,y)-t(a;,y))(^(0(x, 2 /)-t(x,i/)) 



+d 6x . {x) hxZ <t> {9). (457) 
The functions 9 xi (x) and 9 y ^(y) are examples of function hyperparameters 



(see Sect. |5.6|) . 



5.3.4 Local masses and gauge theories 

The Bayesian analog of a mass term in quantum field theory is a term propor- 
tional to the identity matrix I in the inverse prior covariance K . Consider, 
for example, 

K = # 2 I-A, (458) 

with 9 real (so that 9 2 > 0) representing a mass parameter. For large masses 
(f) tends to copy the template t locally, and longer range effects of data points 
following from smoothness requirements become less important. Similarly 
to Sect. |5.3.3| a constant mass can be replaced by a mass function 9(x). 



This allows to adapt locally that interplay between "template copying" and 
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smoothness related influence of training data. As hyperprior, one may use a 
smoothness constraint on the mass function 9(x), e.g., 

l^-t, M 2 (0 - t)) - 1(0 - t, A(0 - t)) + X (9, K e 9) + In Z^9\ (459) 

where M denotes the diagonal mass operator with diagonal elements 9{x). 

Functional hyperparameters like 9[x) represent, in the language of physi- 
cists, additional fields entering the problem (see also Sect. |5T6|) . There are 
similarities for example to gauge fields in physics. In particular, a gauge 
theory-like formalism can be constructed by decomposing 9(x) = J2i9i(x), 
so that the inverse covariance 

K = £ (M, 2 - a?) = E (M< + d t ) (M, - ft) = £ D\D U (460) 

i i i 

can be expressed in terms of a "covariant derivative" Di = ft + ft. Next, one 
may choose as hyperprior for 9i(x) 

i / Blj m, m, \ -r m x 

- £ (ft, -A ft)-(E E ftyfi) = 7 E 4 ( 461 ) 

\ i ' j / 'J 

which can be expressed in terms of a "field strength tensor" (for Abelian 
fields) , 

F .. = q.q. _ d .g it (4 6 2) 

like, for example, the Maxwell tensor in quantum electrodynamics. (To relate 
this, as in electrodynamics, to a local U(l) gauge symmetry <fi — > e ia 4> one can 
consider complex functions <f), with the restriction that their phase cannot 
be measured.) Notice, that, due to the interpretation of the prior as product 
p{4>\9)p(9) , an additional ^-dependent normalization term In Z^iff) enters the 
energy functional. Such a term is not present in quantum field theory, where 
one relates the prior functional directly to p(<ft, 9), so the norm is independent 
of and 9. 



5.3.5 Invariant determinants 

In this section we discuss parameterizations of the covariance of a Gaussian 
specific prior which leave the determinant invariant. In that case no 9— 
dependent normalization factors have to be included which are usually very 
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difficult to calculate. We have to keep in mind, however, that in general a 
large freedom for K(0) effectively diminishes the influence of the parameter- 
ized prior term. 

A determinant is, for example, invariant under general similarity trans- 
formations, i.e., detK = det K for K — > K = OKO 1 where O could be 
any element of the general linear group. Similarity transformations do not 
change the eigenvalues, because from K-0 = \ip follows OKO _1 0^ = XOip. 
Thus, if K is positive definite also K is. The additional constraint that K 
has to be real symmetric, 

K = K T = K f , (463) 
requires O to be real and orthogonal 

CT 1 = O t = O f . (464) 

Furthermore, as an overall factor of O does not change K one can restrict O 
to a special orthogonal group SO(N) with det O = 1. If K has degenerate 
eigenvalues there exist orthogonal transformations with K = K. 

While in one dimension only the identity remains as transformation, the 
condition of an invariant determinant becomes less restrictive in higher di- 
mensions. Thus, especially for large dimension d of K (infinite for continuous 
x) there is a great freedom to adapt covariances without the need to calcu- 
late normalization factors, for example to adapt the sensible directions of a 
multivariate Gaussian. 

A positive definite K can be diagonalized by an orthogonal matrix O 
with det O = 1, i.e., K = ODO T . Parameterizing O the specific prior term 
becomes 

\{<l>-t, K(6) (0 - 1)) = - t, O(0)VO T (6) (</> - f)), (465) 

so the stationarity Eq. (|449|) reads 

<j>-t, ^DO T (0 - t)) = -i4 (466) 

Matrices O from SO(N) include rotations and inversion. For a Gaussian 
specific prior with nondegenerate eigenvalues Eq. ( |466| ) allows therefore to 
adapt the 'sensible' directions of the Gaussian. 

There are also transformations which can change eigenvalues, but leave 
eigenvectors invariant. As example, consider a diagonal matrix D with di- 
agonal elements (and eigenvalues) Aj ^ 0, i.e., det D = JJiXi- Clearly, any 
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permutation of the eigenvalues Aj leaves the determinant invariant and trans- 
forms a positive definite matrix into a positive definite matrix. Furthermore, 
one may introduce continuous parameters 9ij > with i < j and transform 
D — > D according to 

A, -> A, = Ai%, Xj -> Xj = (467) 

Vij 

which leaves the product XiXj = XiXj and therefore also the determinant 
invariant and transforms a positive definite matrix into a positive definite 
matrix. This can be done with every pair of eigenvalues defining a set of 
continuous parameters % with i < j (0y can be completed to a symmetric 
matrix) leading to 

A ? . - A, = A,P^, (468) 
which also leaves the determinant invariant 

detD = nA* = n( A *^rl = (li X i) =^ A * = detD ^ 

^ \ II, ■<>:,■) \ i J Ihllj i'ly, i 

(469) 

A more general transformation with unique parameterization by &i > 0, 
i 7^ i*, still leaving the eigenvectors unchanged, would be 

Xi = Xi9 h % ± z*; Xi* = X,^ n 0^. (470) 

This techniques can be applied to a general positive definite K after diago- 
nalizing 

K = ODO T -> K = ODO T => det K = det K. (471) 



As example consider the transformations (|468 , 470 ) for which the specific 
prior term becomes 

l -{4> - 1, K(9) (0 - t)) = \{4> - t, OV(9)O t (<j> - f)), (472) 
and stationarity Eq. ( |449| ) 

i(0-t,O^O r (0-t)) = -^, (473) 
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and for (|468|) , with k < I, 



= 5{t-j)[S{k- i) A fe — f — - — + 5{l-i) A/— — , (474) 

OVkl \ lln<k U nk llk^n<lVnlJ 

or, for (TO , with k^i*, 

If, for example, K is a translationally invariant operator it is diagonalized 
in a basis of plane waves. Then also K is translationally invariant, but 
its sensitivity to certain frequencies has changed. The optimal sensitivity 
pattern is determined by the given stationarity equations. 

5.3.6 Regularization parameters 

Next we consider the example K(7) = 7K0 where 9 > has been denoted 7, 
representing a regularization parameter or an inverse temperature variable for 
the specific prior. For a <i-dimensional Gaussian integral the normalization 
factor becomes Z^j) = ( — ) 2 (det K ) -1 / 2 . For positive (semi) definite K the 

dimension d is given by the rank of K under a chosen discretization. Skipping 

d 
2 

dK 

we obtain the stationarity equations 



constants results in a normalization energy E N {i) = -fin 7. With 

K (476) 



7 K o (0-t) = P'(0)P (0)^-P'(0)Ax, (477) 



~(0-*,K o (0-O) = / 

Z Z^j 



t, K o (0-t)) = —-E' r (478) 



For compensating hyperprior the right hand side of Eq. (|478|) vanishes, giving 



thus no stationary point for 7. Using however the condition 7 > one sees 



that for positive definite Kq Eq. (477) is minimized for 7 = corresponding 



to the 'prior-free' case. For example, in the case of Gaussian regression the 
solution would be the data template <fi — h — to- This is also known as 
"<5-catastrophe" . To get a nontrivial solution for 7 a noncompensating hy- 
perparameter energy E 1 = Eg must be used so that In + E N is nonuniform 

© HI- 
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Figure 6: Shown is the joint posterior density of h and 7, i.e., p(h, j\D, D ) oc 
p{yD\h)p{h\^f, D )p( / ~f) for a zero-dimensional example of Gaussian regression 
with training data yjj = and prior data yu = 1- L.h.s: For uniform prior 



^(7) oc 1 so that the joint posterior becomes p oc e 



L fe 2_ i(fe _ 1)2+ i ln7 



having 



its maximum is at 7 = 00, h — 1. R.h.s.: For compensating hyperprior 



^(7) oc 1/^/7 so that p oc e 
/i = 0. 



2™ 



-i(ft-l) 2 



having its maximum is at 7 = 0, 



The other limiting case is a vanishing E 1 for which Eq. 

d 

7 = 7 



t, K (0 - t)) 



) becomes 
(479) 



For (j) — > t one sees that 7 — > 00. Moreover, in case P[£] represents a nor- 
malized probability, <fi = t is also a solution of the first stationarity equation 
( 4771) i n the limit 7 — > 00. Thus, for vanishing P^ the 'data-free' solution 
= t is a selfconsistent solution of the stationarity equations ( f477| , [478| ). 

Fig.||| shows a posterior surface for uniform and for compensating hyper- 
prior for a one-dimensional regression example. The Maximum A Posteriori 
Approximation corresponds to the highest point of the joint posterior over 
7, h in that figures. Alternatively one can treat the 7-integral by Monte- 
Carlo-methods ||23 1| . 



Finally we remark that in the setting of empirical risk minimization, 
due to the different interpretation of the error functional, regularization pa- 
rameters are usually determined by cross-validation or similar techniques 
I63j §, [22|, Q [212], [39], |22|, |4|, §3 . 
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5.4 Exact posterior for hyperparameters 

In the previous sections we have studied saddle point approximations which 
lead us to maximize the joint posterior p(h,9\D, Dq) simultaneously with 
respect to the hidden variables h and 9 

p(y\x,D,D ) = p(y D \x D ,D )- 1 Jdh fd8p(y\x, h) p(y D \x D , h)p(h\D , fl)p(fl) , 

ocp(h,9\D,Do), max w.r.t. 9 and h 

(480) 

assuming for the maximization with respect to h a slowly varying p(y\x, h) 
at the stationary point. 

This simultaneous maximization with respect to both variables is consis- 
tent with the usual asymptotic justification of a saddle point approximation. 
For example, for a function f(h,6) of two (for example, one-dimensional) 
variables h, 9 

dhd9e-^ h ^ « e -/V(W-£mdetC0H/2*) (4gl) 

for large enough (3 (and a unique maximum). Here f(h*,9*) denotes the joint 
minimum and H the Hessian of / with respect to h and 9. For ^-dependent 
determinant of the covariance and the usual definition of (3, results in a 
function / of the form f(h,6) = E(h,9) + (1/2/3) In det(/3K(0)/27r), where 
both terms are relevant for the minimization of / with respect to 9. For 
large /3, however, the second term becomes small compared to the first one. 
(Of course, there is the possibility that a saddle point approximation is not 
adequate for the 9 integration. Also, we have seen that the condition of a 
positive definite covariance may lead to a solution for 9 on the boundary 
where the (unrestricted) stationarity equation is not fulfilled.) 

Alternatively, one might think of performing the two integrals stepwise. 
This seems especially useful if one integral can be calculated analytically. 
Consider, for example 

dhdBe-MW> « fd9e-^ h '^-^ det ^^^^ 



(482) 

which would be exact for a Gaussian ^.-integral. One sees now that mini- 
mizing the complete negative exponent j3f(9,h*) + ^\ndet(/3(d 2 f/dh 2 )/27i) 



with respect to 9 is different from minimizing only / in ( [48 lj ), if the second 
derivative of / with respect to h depends on 9 (which is not the case for 
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a Gaussian 9 integral). Again this additional term becomes negligible for 
large enough (3. Thus, at least asymptotically, this term may be altered or 
even be skipped, and differences in the results of the variants of saddle point 
approximation will be expected to be small. 

Stepwise approaches like (|482|) can be used, for example to perform Gaus- 
sian integrations analytically, and lead to somewhat simpler stationarity 



equations for ^-dependent covariances ||231| . 

In particular, let us look at the case of Gaussian regression in a bit more 
detail. The following discussion, however, also applies to density estimation 



if, as in ( |482|) , the Gaussian first step integration is replaced by a saddle point 
approximation including the normalization factor. (This requires the calcu- 
lation of the determinant of the Hessian.) Consider the two step procedure 
for Gaussian regression 

p{y\x, D, D ) = p{y D \x D , Do)' 1 d9p(9) dhp(y\x, h)p(y D \x D , h)p(h\D , 9), 



exact 



p(6)p(y,y]j\x,X£),DQ,6)ccp(y,9\x,D,DQ) maxw.r.t. 9 



J d9 p(6\D, D ) p(y\x,D D ,9) (483) 



ocexact 



p(y,9\x,D,Do), maxw.r.t. 8 

where in a first step p(y, yo\x, Xd, D , 9) can be calculated analytically and in 
a second step the 9 integral is performed by Gaussian approximation around 
a stationary point. Instead of maximizing the joint posterior p(h,9\D, Dq) 
with respect to h and 9 this approach performs the /i-integration analytically 
and maximizes p(y,9\x, D, D ) with respect to 9. The disadvantage of this 
approach is the y-, and x-dependency of the resulting solution. 

Thus, assuming a slowly varying p(y\x, D, D , 9) at the stationary point 
it appears simpler to maximize the /i-marginalized posterior, p{9\D, Dq) = 
J dhp(h,9\D, Dq), if the /i-integration can be performed exactly, 

p(y\x,D,D )= fd9p(9\D,DQ) p(y\x,D Dq,9\ . (484) 



Having found a maximum posterior solution 9* the corresponding analytical 
solution for p(y\x, D, D , 9*) is then given by Eq. ( flllj) . The posterior density 
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p(0\D, D ) can be obtained from the likelihood of 9 and a specified prior p{9) 

p(e\D,D a) = ^f°; D °^f\ (485) 
p{y D \x D ,D ) 

Thus, in case the ^-likelihood can be calculated analytically, the ^-integral is 
calculated in saddle point approximation by maximizing the posterior for 9 
with respect to 9. In the case of a uniform p(9) the optimal 9* is obtained by 
maximizing the ^-likelihood. This is also known as empirical Bayes approach 
p5| . As h is integrated out in p(y D \xn, D ,9) the ^-likelihood is also called 
marginalized likelihood. 

Indeed, for Gaussian regression, the ^-likelihood can be integrated ana- 
lytically, analogously to Section |3.7.2| , yielding ||223| , [232j , [231|| , 



p(y D \x D ,D ,9) = Jdhp(y D \x D ,h)p(h\D ,9) 

= Jdh e~5 ELo (h-ti,K4(h-ti))+% J2to ladet ^/ 2 ^ 

= e -| Er=0 (*i,Kiti)+|(*,Kt)+| lndct D (K/27r) 

-5 (tl5-*0, K(to-to)) +| lndct D K-f ln(27r) 
= g-S+ilndotoK-I^Tr)^ ( 4g6 ) 

where £ = ±(t D - *„, K(t D - t )), K = (K5 1 + K^fl))" 1 = K D + 
KdK _1 Kd, detp the determinant in data space, and we used that from 
K- X K, = 6 tj for z, j > follows E? =0 (U, K&) = (t D , K D t D ) + (t Q , K t ) 
= (to, Kt), with K = EiLo-K*- In cases where the marginalization over h, 
necessary to obtain the evidence, cannot be performed analytically and all 
/i-integrals are calculated in saddle point approximation, we get the same re- 
sult as for a direct simultaneous MAP for h and 9 for the predictive density 
as indicated in ( |480| ). 

Now we are able to compare the three resulting stationary equations 
for ^-dependent mean t (9), covariance K (6 I ) and prior p{9). Setting the 
derivative of the joint posterior p(h, 9\D, D ) with respect to 9 to zero yields 

o = (§* Kofo-k)) +5(fc-«H^(fc-t 
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This equation which we have already discussed has to be solved simulta- 
neously with the stationarity equation for h. While this approach is easily 
adapted to general density estimation problems, its difficulty for ^-dependent 
covariance determinants lies in calculation of the derivative of the determi- 
nant of Ko- Maximizing the /i-marginalized posterior p(6\D, Dq), on the 
other hand, only requires the calculation of the derivative of the determinant 
of the n x n matrix K 

= (?|, K(fo-f D )) +i((*D-*o), ^(«d-«o)J 




- Tl { K w)-W>w- (488) 

Evaluated at the stationary h* = t + Kq 1 K(t£ ) — t ), the first term of Eq. 



(|487|) , which does not contain derivatives of the covariances, becomes equal to 



the first term of Eq. ( |488|) . The last terms of Eqs. (|487|) and ( |488| ) are always 



identical. Typically, the data-independent K has a more regular structure 
than the data-dependent K. Thus, at least for one or two dimensional x, a 
straightforward numerical solution of Eq. ( }487|) by discretizing x can also be 
a good choice for Gaussian regression problems. 

Analogously, from Eq. ( |311|) follows for maximizing p(y, 9\x, D, D ) with 
respect to 6 



._ x dKy \ _ 1 <)/,(() /)./■ 
36 ) p(6\D,D ) 06 



-Tr K" ^ ) - „S an > (489) 



which is y—, and x-dependent. Such an approach may be considered if inter- 
ested only in specific test data x, y. 

We may remark that also in Gaussian regression the ^-integral may be 
quite different from a Gaussian integral, so a saddle point approximation 
does not necessarily have to give satisfactory results. In cases one encoun- 
ters problems one can, for example, try variable transformations J f(6)d6 = 
J det(<96> / '06') f '{6{6'))d6' to obtain a more Gaussian shape of the integrand. 
Due to the presence of the Jacobian determinant, however, the asymptotic 
interpretation of the corresponding saddle point approximation is different 
for the two integrals. The variability of saddle point approximations results 
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from the freedom to add terms which vanish asymptotically but remains fi- 
nite in the nonasymptotic region. Similar effects are known in quantum many 
body theory (see for example ||169|| , chapter 7.) Alternatively, the ^-integral 
can be solved numerically by Monte Carlo methods [ |232| , |231|| . 



5.5 Integer hyperparameters 

The hyperparameters 9 considered up to now have been real numbers, or 
vector of real numbers. Such hyperparameters can describe continuous trans- 
formations, like the translation, rotation or scaling of template functions and 
the scaling of covariance operators. For real 9 and differentiate posterior, 
stationarity conditions can be found by differentiating the posterior with 
respect to 9. 

Instead of a class of continuous transformations a finite number of al- 
ternative template functions or covariances may be given. For example, an 
image to be reconstructed might be expected to show a digit between zero 
and nine, a letter from some alphabet, or the face of someone who is a mem- 
ber of known group of people. Similarly, a particular times series may be 
expected to be either in a high or in a low variance regime. In all these cases, 
there exist a finite number of classes i which could be represented by specific 
templates or covariances Kj. Such "class" variables i are nothing else than 
hyperparameters 9 with integer values. 

Binary parameters, for example, allow to select from two reference func- 
tions or two covariances that one which fits the data best. E.g., for i = 
9 G {0, 1} one can write 

t{9) = (l-9)tx + 9h, (490) 
K(0) = (1-0)K! + 0K 2 . (491) 

For integer 9 the integral / d9 becomes a sum J2$ ( we will also use the 
letter % and write Yl,i for integer hyperparameters), so that prior, posterior, 
and predictive density have the form of a finite mixture with components 9. 

For a moderate number of components one may be able to include all 
of the mixture components. Such prior mixture models will be studied in 
Section || 

If the number of mixture components is too large to include them all 
explicitly, one again must restrict to some of them. One possibility is to 
select a random sample using Monte-Carlo methods. Alternatively, one may 
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search for the 9* with maximal posterior. In contrast to typical optimization 
problems for real variables, the corresponding integer optimization problems 
are usually not very smooth with respect to 9 (with smoothness defined in 
terms of differences instead of derivatives), and are therefore often much 
harder to solve. 

There exists, however, a variety of deterministic and stochastic integer op- 
timization algorithms, which may be combined with ensemble methods like 
genetic algorithms [98|, [79|, [44], |154| , L21| , f204j , |157|| , and with homotopy meth- 



ods, like simulated annealing |T|, [L53|, |S| [|| |, |9|, |3| || |3|, |4| . An 



nealing methods are similar to (Markov chain) Monte-Carlo methods, which 
aim in sampling many points from a specific distribution (i.e., for example 
at fixed temperature). For them it is important to have (nearly) indepen- 
dent samples and the correct limiting distribution of the Markov chain. For 
annealing methods the aim is to find the correct minimum (i.e., the ground 
state having zero temperature) by smoothly changing the temperature from 
a finite value to zero. For them it is less important to model the distribution 
for nonzero temperatures exactly, but it is important to use an adequate 
cooling scheme for lowering the temperature. 

Instead of an integer optimization problem one may also try to solve a 
similar problem for real 9. For example, the binary 9 G {0, 1} in Eqs. (|490|) 



and (|491|) may be extended to real 9 G [0, 1]. By smoothly increasing an 



appropriate additional hyperprior p(9) one can finally enforce again binary 
hyperparameters 9 G {0, 1}. 

5.6 Local hyperfields 

Most, but not all hyperparameters 9 considered so far have been real or 
integer numbers, or vectors with real or integer components 9^. With the 



unrestricted template functions of Sect. |5.2.3| or the functions parameterizing 



the covariance in Sections |5.3.3| and [5.3.4| , we have, however, also already 



encountered function hyperparameters or hyperfields. In this section we will 
now discuss function hyperparameters in more detail. 

Functions can be seen as continuous vectors, the function values 9(u) be- 
ing the (continuous) analogue of vector components 9{. In numerical calcula- 
tions, in particular, functions usually have to be discretized, so, numerically, 
functions stand for high dimensional vectors. 

Typical arguments of function hyperparameters are the independent vari- 
ables x and, for general density estimation, also the dependent variables y. 
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Such functions 6{x) or 9(x, y) will be called local hyperparameters or local hy- 
perfields. Local hyperfields 0{x) can be used, for example, to adapt templates 
or covariances locally. (For general density estimation problems replace here 
and in the following x by (x,y).) 

The price to be paid for the additional flexibility of function hyperparam- 
eters is a large number of additional degrees of freedom. This can consid- 
erably complicate calculations and, requires a sufficient number of training 
data and/or a sufficiently restrictive hyperprior to be able to determine the 
hyperfield and not to make the prior useless. 

To introduce local hyperparameters 6{x) we express real symmetric, pos- 
itive (semi-) definite inverse covariances by square roots or "filter operators" 
W, K = W T W — Jdx W x Wj where W x represents the vector W(x, •). 
Thus, in components 

K(x, x') = J dx" W T {x, x")W(x", x'), (492) 

and therefore 

(0 - t , K(0 - *)) = Jdx dx' [(f)(x) - t(x)} K T {x, x') [<f){x') - t(x')} 

= J dx dx' dx" [<p(x) — t(x)] W T (x, x) 

x W(x',x")[(f)(x")-t(x")] 
= jdx\uj(x)\ 2 , (493) 

where we defined the "filtered differences" 

u{x) = (W x ,<l>-t) = Jdx'W(x,x')[(f)(x') -t(x')}. (494) 

Thus, for a Gaussian prior for we have 

p{4>) oc e-^-^m-t)) = e -U^H-)\\ (495) 

A real local hyperfield 9(x) mixing, for instance, locally two alternative fil- 
tered differences may now be introduced as follows 

p(0|0) = e -5/«kM*;0)| 2 -lnZ*(tf) = e -±fdx\[l-0(x)]ui(xW(x)u2{x)\ 2 -lnZ+(0)^ ^gg) 

where 

u(x; 6) = [1- 8(x)\ Ui{x) + 8(x) u 2 (x), (497) 
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and, say, 9(x) G [0,1]. For unrestricted real 9(x) an arbitrary real u(x;9) 
can be obtained. For a binary local hyperfield with 9(x) G {0, 1} we have 9 2 
= 9,(1- 9 f = (1 - 0), and 9(1 - 9) = 0, so Eq. flMg) becomes 

(498) 

For real 9(x) in Eq. ( gg7| ) terms with 9 2 (x), [1 - ^(s)] 2 , and [1 - 9(x)]9(x) 
would appear in Eq. (|498 ). A binary 9 variable can be obtained from a real 
9 by replacing 

B e (x) = B(9(x) 9(x). (499) 

Clearly, if both prior and hyperprior are formulated in terms of such Bq(x) 
this is equivalent to using directly a binary hyperfield. 

For a local hyperfield 9(x) a local adaption of the functions u(x; 9) as in 
Eq. ( 497 ) can be achieved by switching locally between alternative templates 
or alternative filter operators W 

t x (x'-9) = [l-9(x)]t hx (x') + 9(x)t 2:X (x'), (500) 
W(x,x';9) = [l-9(x)]W 1 (x,x') + 9(x)W 2 (x,x'). (501) 



In Eq. ( |500[ ) it is important to notice that "local" templates t x (x'; 9) for fixed 
x are still functions of an x' variable. Indeed, to obtain u(x\ 9), the function 
t x is needed for all x' for which W has nonzero entries. 

uj(x;9) = Jdx'W(x,x')[(t)(x) -t x (x';9)}. (502) 

That means that the template is adapted individually for every local filtered 
difference. Thus, Eq. ( j500|) has to be distinguished from the choice 

t(x'; 9) = [1 - 9(x')} h(x') + 9(x') t 2 (x'). (503) 

The unrestricted adaption of templates discussed in Sect. |5.2.3| , for instance, 
can be seen as an approach of the form of Eq. ( |503| ) with an unbounded real 
hyperfield 9 (x). 

Eq. ( |501| ) corresponds for binary 9 to an inverse covariance 

K(0) = Jdx K x (9) = Jdx ([1 - 9(x)]W l . x Wl x + 9(x)W 2>x W? x ) , (504) 

where K x (9) = W x (9)W i r (9) and W hX = W<(x, •), W x (9) = W(x, ■ ; 9). We 
remark that ^-dependent covariances require to include the normalization 
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factors when integrating over 9 or solving for the optimal 9 in MAP. If we 
consider two binary hyperfields 9, 9', one for t and one for W, we find a prior 

p{0,9') oc e - E ^ 8 ^ = e -^(^M*).K,(0O[<M*(< (505) 

Up to a (^-independent constant (which still depends on 9, 9') the corre- 
sponding prior energy can again be written in the form 

E{4>\0, 9') =\(4>- m 0') , K(0') [0 - t(9, 9')]) . (506) 

Indeed, the corresponding effective template t(9, 9') and effective covariance 
K(0') are according to Eqs. ( |246| , |249j ) given by 

t{9,9') = K(0') _1 [dxK x (9')t x (6), (507) 

K(0') = JdxK x (9'). (508) 

Hence, one may rewrite 

Jdx\u{x;9,9')\ 2 = (<j)-t{9,9'), K{9')[<f)-t{9,9')]) (509) 

+ K x (9')t x (9)) - (t(9,9'), Kt(9,9')). 

X 

The MAP solution of Gaussian regression for a prior corresponding to 



at optimal 9*, 9'* is according to Section [T7| therefore given by 

(j)*(9*, 9'*) = [K D + K^'*)]- 1 (Kd*d + K(0 t{9*, 9'*)) . (510) 

One may avoid dealing with "local" templates t x {6) by adapting templates 
in prior terms where K is equal to the identity I. In that case (t Q ) x (x'; 9) is 
only needed for x = x' and we may thus directly write (t ) x (x f ; 9) = t (x'; 9). 
As example, consider the following prior energy, where the ^-dependent tem- 
plate is located in a term with K = I and another, say smoothness, prior is 
added with zero template 

E(<f>\9) = l(<t>-to(0), (0-t o (0))) + ~(>, K o 0). (511) 
Combining both terms yields 

£(#) = i((0-f(0), K[0-t(0)]) + (t o (0), (i-K- 1 )^))), (512) 
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with effective template and effective inverse covariance 

t(6)=K-H {6), K = I + K . (513) 

For differential operators W the effective t(9) is thus a smoothed version of 
t„(0). 

The extreme case would be to treat t and W itself as unrestricted hyper- 
parameters. Notice, however, that increasing flexibility tends to lower the 
influence of the corresponding prior term. That means, using completely free 
templates and covariances without introducing additional restricting hyper- 
priors, just eliminates the corresponding prior term (see Section |5.2.3| ). 

Hence, to restrict the flexibility, typically a smoothness hyperprior may 
be imposed to prevent highly oscillating functions 9(x). For real 9(x), for 
example, a smoothness prior like (9, —A9) can be used in regions where it 
is defined. (The space of ^-functions for which a smoothness prior (0 — 
t, K(0 — £)) with discontinuous t(9) is defined depends on the locations of 
the discontinuities.) An example of a non-Gaussian hyperprior is, written 
here for a one-dimensional x, 

p(9) oce-%f dxCeix) , (514) 

where k is some constant and 



C (x) = G — - # e . (515) 




is zero at locations where the square of the first derivative is smaller than 
a certain threshold < $g < oo, and one otherwise. (The step function 
is defined as Q(x) = for x < and Q(x) = 1 for 1 < x < oo.) To 
enable differentiation the step function G could be replaced by a sigmoidal 
function. For discrete x one can count analogously the number of jumps 
larger than a given threshold. Similarly, one may penalize the number N^(9) 

of discontinuities where {j^j = oo and use 



p(9) oc e"3 Nd(9) . (516) 

In the case of a binary field this corresponds to counting the number of times 
the field changes its value. 
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The expression Cq of Eq. (|515|) can be generalized to 

C (x)=Q(\u O (x)\ 2 -#<,), 

where, analogously to Eq. ( [494| ) , 

loq{x) = I dx We(x, x')[9{x) — tg(x , )\, 



(517) 



(518) 



and We is some filter operator acting on the hyperfield and tg(x') is a tem- 
plate for the hyperfield. 

Discontinuous functions can either be approximated by using discon- 
tinuous templates t(x; 9) or by eliminating matrix elements of the inverse 
covariance which connect the two sides of the discontinuity. For example, 
consider the discrete version of a negative Laplacian with periodic boundary 
conditions, 
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and square root, 
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The first three points can be disconnected from the last three points by 
setting W(3) and W(6) to zero, namely, 
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so that the smoothness prior is ineffective between points from different re- 
gions, 



K = W'W 
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In contrast to using discontinuous templates, the height of the jump at the 
discontinuity has not to be given in advance when using such disconnected 
Laplacians (or other inverse covariances) . On the other hand training data 
are then required for all separated regions to determine the free constants 
which correspond to the zero modes of the Laplacian. 

Non-Gaussian priors, which will be discussed in more detail in the next 
Section, often provide an alternative to the use of function hyperparameters. 
Similarly to Eq. (|515|) one may for example define a binary function B(x) in 
terms of 0, 

B(x) = (K(x)| 2 - \lu 2 (x)\ 2 - #) , (523) 
like, for a negative Laplacian prior, 



b(x) = e 



d{<f> - h 



dx 



d(<p - 1 2 



dx 



-■d 



(524) 



Here B(x) is directly determined by and is not considered as an independent 
hyperfield. Notice also that the functions Ui(x) and B(x) may be nonlocal 
with respect to <f)(x), meaning they may depend on more than one <fi{x) value. 
The threshold d has to be related to the prior expectations on cjj. A possible 
non-Gaussian prior for formulated in terms of B can be, 



■pU) oc e -5/ d:E (K( a; )l 2 ( 1 - B W)+l^2WI 2 BW-f^(-B))_ 



(525) 



with Nd{B) counting the number of discontinuities of B(x). Alternatively to 
one may for a real B define, similarly to ( |517| ), 



with 



c(x) = e(Mx)| 2 - V), 

lu b (x) = J dx' W B (x, x')[B(x') — t B (x') 



(526) 
(527) 
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and some filter operator Wg and template tg. Similarly to the introduction 
of hyperparameters, one can treat B(x) formally as an independent function 
by including a term A (B(x) — 9 (|co>i(:r)| 2 — |co> 2 (x)| 2 — $)) in the prior energy 
and taking the limit A — > oo. 

Eq. (|525|) looks similar to the combination of the prior (@98D with the 
hyperprior (|516|) , 

pUQ) OC e -l/^(l^iWI 2 ( 1 -- B 9W)+l^2(x)| 2 J B 9 (aO-fiV d (iJ e )--lnZ (/) (e))_ ^ 528 ^ 

Notice, however, that the definition ( |499| ) of the hyperfield B e (and N d (B e ) 
or Cq, respectively), is different from that of B (and Nd(B) or C), which are 
direct functions of <fi. If the Ui differ only in their templates, the normalization 



term can be skipped. Then, identifying Bg in ( 528 ) with a binary 9 and 
assuming i? = 0, $g = $b, W# = W#, the two equations are equivalent 
for 9 = 9 (|cj 1 (x)| 2 — lu^^) | 2 )- In the absence of hyperpriors, it is indeed 
easily seen that this is a selfconsistent solution for 9, given 0. In general, 
however, when hyperpriors are included, another solution for 9 may have a 
larger posterior. Non-Gaussian priors will be discussed in Section |6.5| . 

Hyperpriors or non-Gaussian prior terms are useful to enforce specific 
global constraints for 9(x) or B(x). In images, for example, discontinuities 
are expected to form closed curves. Hyperpriors, organizing discontinuities 



along lines or closed curves, are thus important for image segmentation [70 
ISO] EE B71 E33L E5T 



6 Non— Gaussian prior factors 

6.1 Mixtures of Gaussian prior factors 

Complex, non-Gaussian prior factors, for example being multimodal, may be 
constructed or approximated by using mixtures of simpler prior components. 
In particular, it is convenient to use as components or "building blocks" 
Gaussian densities, as then many useful results obtained for Gaussian pro- 
cesses survive the generalization to mixture models [|131| , |132| , |133| , |134| , |135 



We will therefore in the following discuss applications of mixtures of Gaus- 
sian priors. Gther implementations of non-Gaussian priors will be discussed 



in Section 6.5. 



In Section 5.1 we have seen that hyperparameters label components of 



mixture densities. Thus, if j labels the components of a mixture model, then 
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j can be seen as hyperparameter. In Section |S| we have treated the corre- 
sponding hyperparameter integration completely in saddle point approxima- 
tion. In this section we will assume the hyperparameters j to be discrete and 
try to calculate the corresponding summation exactly 

Hence, consider a discrete hyperparameter j, possibly in addition to con- 
tinuous hyperparameters 9. In contrast to the ^-integral we aim now in 
treating the analogous sum over j exactly, i.e., we want to study mixture 
models 

m m 

p(<P,9\D ) =5>(0,MA)) =Y,p(<!>\D O ,0J)p(9,j)- (529) 

3 j 

In the following we concentrate on mixtures of Gaussian specific priors. No- 
tice that such models do not correspond to Gaussian mixture models for <p 
as they are often used in density estimation. Indeed, the form of (f) may be 
completely unrestricted, it is only its prior or posterior density which is mod- 
eled by a mixture. We also remark that a strict asymptotical justification of 
a saddle point approximation would require the introduction of a parameter 

P so that p((f), 9\D ) oc e ^-'•»' Pj . If the sum is reduced to a single term, then 
$ corresponds to (3. 

We already discussed shortly in Section ^]2|that, in contrast to a product 
of probabilities or a sum of error terms implementing a probabilistic AND 
of approximation conditions, a sum over j implements a probabilistic OR. 
Those alternative approximation conditions will in the sequel be represented 
by alternative templates tj and covariances Kj. A prior (or posterior) den- 
sity in form of a probabilistic OR means that the optimal solution does not 
necessarily have to approximate all but possibly only one of the tj (in a met- 
ric defined by Kj). For example, we may expect in an image reconstruction 
task blue or brown eyes whereas a mixture between blue and brown might 
not be as likely. Prior mixture models are potentially useful for 

1. Ambiguous (prior) data. Alternative templates can for example repre- 
sent different expected trends for a time series. 

2. Model selection. Here templates represent alternative reference models 
(e.g., different neural network architectures, decision trees) and deter- 
mining the optimal 9 corresponds to training of such models. 

3. Expert knowledge. Assume a priori knowledge to be formulated in 
terms of conjunctions and disjunctions of simple components or build- 
ing blocks (for example verbally). E.g., an image of a face is expected 
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to contain certain constituents (eyes, mouth, nose; AND) appearing 
in various possible variants (OR). Representing the simple compo- 
nents/building blocks by Gaussian priors centered around a typical 
example (e.g., of an eye) results in Gaussian mixture models. This con- 
stitutes a possible interface between symbolic and statistical methods. 
Such an application of prior mixture models has some similarities with 



the quantification of "linguistic variables" by fuzzy methods [118, 119] 



For a discussion of possible applications of prior mixture models see also 
|131| , |132| , |133j , |134| , |135|| . An application of prior mixture models to image 



completion can be found in [|136|| . 

6.2 Prior mixtures for density estimation 

The mixture approach (|529|) leads in general to non-convex error functionals. 



For Gaussian components Eq. ( |529|) results in an error functional 



E e ^ = -QnP(0), N) + (P(<P), A x ) 

- In J2 e ~( * ( ^ <J (0) ' Kj {e) (< ^* J m+ln Z,p (0 ' j)+Ee >i ) , (530) 



j 

= -ln^Te"^-^^, (531) 

j 

where 

= -(lnP(0), iV) + (P(0), Ax) + i(^-*i(e), Ktf) tt-tjffl)), (532) 

and 

Cj = -\nZ^e,j). (533) 
The stationarity equations for and 9 

= >■:., >■:■,:■<:. (534) 

= e(^+^+z;v(^^))^-^ +c ^ (535) 
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can also be written 



= E-rrrt&MAO, (536) 



'" f dE AA OK 



= El^f + ^T + (537) 



Analogous equations are obtained for parameterized </>(£)• 
6.3 Prior mixtures for regression 

For regression it is especially useful to introduce an inverse temperature mul- 
tiplying the terms depending on 0, i.e., likelihood and prior .0 As in regression 
is represented by the regression function h(x) the temperature-dependent 
error functional becomes 

m m 

E e , h = - In ]T e-^~ E ^ +c i = - In ]T e~ E ^ +c \ (538) 
j j 

with 



Ej = E D + E j + E 9t pj, (539) 

))), 
(540) 



^ = \ (h - t D , K D (h - t D )) , E 0ij = \{h- tjie), KjiB) (h - tjiO))) 



some hyperprior energy Eg^j, and 



n 8 

Cj (9,f3) = -\nZ h {9,j,6) + -\n8-tLv D 



1 A _|_ r) /? 

- lndet (K^)) + ^ln/3 - |vb (541) 



with some constant c. If we also maximize with respect to (3 we have to 
include the (/i-independent) training data variance Vd = J27 Vi where Vi = 
Y^k y( x k) z '/ n i — t 2 D {xi) is the variance of the training data at Xi. In case 
every x^ appears only once Vd vanishes. Notice that Cj includes a contribution 
from the n data points arising from the /3-dependent normalization of the 

3 As also the likelihood term depends on (3 it may be considered part of a cf> together 
regression function h(x). Due to its similarity to a regularization factor we have included 
[3 in this chapter about hyperparameters. 
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likelihood term. Writing the stationarity equation for the hyperparameter (3 
separately, the corresponding three stationarity conditions are found as 

m 

= Y,( K D(h-t D ) + K j (h-t j ))e- (3E ^- Ee ^ +c ^ (542) 
j 

= Efe + ^ + ^fe-^lle-^-^^, (543) 



j 

m 



= E^ + ^ + ^Je-^-W^. (544) 

As (3 is only a one-dimensional parameter and its density can be quite non- 
Gaussian it is probably most times more informative to solve for varying 
values of (3 instead to restrict to a single 'optimal' (3* . Eq. fl542j) can also be 
written ^ 

h= (K D + f2 aj K-\ (K D t D + £ ajKjt^ , (545) 

with 

e ~Ej+Cj e -pE j- B 9ii3i3 -+i IndetKj 

% = P(j\h,0,(3,D ) 



(546) 



e -E k +c k e -f3E , k -E e ,t3,k+^lndetK k 

P (h\j, e, p, D ) P (j\e, g, Dp) j, g, goMi, e|/?, Dp 

p(h\6,(3,D ) " p(h,9\(3,D ) 

being thus still a nonlinear equation for /i. 

6.3.1 High and low temperature limits 

It are the limits of large and small f3 which make the introduction of this 
additional parameter useful. The reason being that the high temperature 
limit (3 — > gives the convex case, and statistical mechanics provides us with 
high and low temperature expansions. Hence, we study the high temperature 
and low temperature limits of Eq. (|545 ) . 



In the high temperature limit (3 — > the exponential factors aj become 
/i-independent 



a, t% a o = — _ ^ (547) 

£ m e~ Be " 3 ' fc+ 5 lndetKfc 
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In case one chooses Eg^j = Epj + (3E e one has to replace E e ^j by Epj. The 
high temperature solution becomes 

h = i (548) 
with (generalized) 'complete template average' 

(m \ ^ / rrt \ 

K D + ]T ajKjJ (K^ + £ ajK^-J . (549) 

Notice that i corresponds to the minimum of the quadratic functional 

m 

E {p=oo) = (h-t D , K D (h-t D ))+J2a°j(h-tj, Kj(h-tj)). (550) 

3 

Thus, in the infinite temperature limit a combination of quadratic priors by 
OR is effectively replaced by a combination by AND. 

In the low temperature limit j3 — > oo we have, assuming Eg^j = E@ + 
Ej + PEe, 



3 



^ e -P(E O:j * + E e) -E - Ej for ^ <jBoj , VjVj*, P(J*)^ 0,, (552) 
meaning that 

/^oo a oo = f 1 : J = argmin^oj = argmin^j 
J ' J \ : j 7^ argmin^oj = argmin^j 

Henceforth, all (generalized) 'component averages' tj become solutions 

h = ij, (554) 

with 

tj = (Kfl + K^ 1 (K D t D + Kjtj) , (555) 
provided the tj fulfill the stability condition 

E h ,j(h = tj)<E hJ ,(h = tj), Vf^j, (556) 
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i.e., 

Vj < - V, (K D + K,,) - f r j) + V f , Vj" + j, (557) 

where 

V$ = X - [(t D , K D t D ) + (tj, Kj tj) - (tj, (K D + Kj) tjU. (558) 

That means single components become solutions at zero temperature 1/(3 
in case their (generalized) 'template variance' Vj, measuring the discrepancy 
between data and prior term, is not too large. Eq. ( [545| ) for h can also be 
expressed by the (potential) low temperature solutions ij 

(m \ ^ m 

»,SK/> • Koj Y^nJKn ■ K /) I r (559) 

Summarizing, in the high temperature limit the stationarity equation 
( |542| ) becomes linear with a single solution being essentially a (generalized) 
average of all template functions. In the low temperature limit the sin- 
gle component solutions become stable provided their (generalized) variance 
corresponding to their minimal error is small enough. 



6.3.2 Equal covariances 

Especially interesting is the case of j-independent Kj(#) = K (6 ) ) and 6- 
independent det Ko(#). In that case the often difficult to obtain determinants 
of Kj do not have to be calculated. 

For j-independent covariances the high temperature solution is according 
to Eqs. (|549| . |555|) a linear combination of the (potential) low temperature 
solutions 

m 

t = £a°t,, (560) 

j 

It is worth to emphasize that, as the solution t is not a mixture of the 
component templates tj but of component solutions ij, even poor choices 
for the template functions tj can lead to good solutions, if enough data are 
available. That is indeed the reason why the most common choice to = for 
a Gaussian prior can be successful. 
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Eqs.( |559| ) simplifies to 



1 ■e~^ Eh -^~ Ee ^'i JrC i m m 
h = ^ m [- aEh Ah)-E„ n = E a fo =t + E( fl J - a i) ( 561 ) 



".I 

where 

and (for j -independent <i) 



f, = (K D + Kq)- 1 (K D t D + K t,) , (562) 



introducing vector a with components aj, m x m matrices 

= (4 - tj, (K D + K ) (f, - *"•)) (564) 

and constants 

dj = -pV 3 -E e ^, (565) 
with Vj given in (|558|) . Eq. ( |561| ) is still a nonlinear equation for h, it 



shows however that the solutions must be convex combinations of the h- 
independent tj. Thus, it is sufficient to solve Eq. ( |563| ) for m mixture coeffi- 
cients aj instead of Eq. ( p42j ) for the function h. 
The high temperature relation Eq. (|547 ) becomes 



or a® = 1/m for a hyperprior p(6,/3,j) uniform with respect to j. The low 
temperature relation Eq. (|553|) remains unchanged. 
For m = 2 Eq. ( |561| ) becomes 

h = J2 a jtj = 1+ 2 2 + («i - Q 2) ~ * 2 = - + 2 h + (tanh A) - - h , (567) 

i 

with (?! + ?2)/2 = t in case i^/y is uniform in j so that a° = 0.5, and 

2 ~ 2 2 

= -ti a (B 1 -B 2 )a + ^ Y ^ = ^ 6(2 0l - 1) + J -y^, (568) 
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Figure 7: The solution of stationary equation Eq. ( |563|) is given by the point 
where aie~2 ba2+d2 = (1 — ai)e~2 b ( 1 ~ ai } 2+dl (upper row) or, equivalently, a\ = 
2 (tanh A + 1) (lower row). Shown are, from left to right, a situation at high 
temperature and one stable solution (/3 = 2), at a temperature {(3 = 2.75) 
near the bifurcation, and at low temperature with two stable and one unstable 
solutions (3 = 4. The values of b = 2, d\ = —0.2025/3 and d 2 = —0.3025/3 used 
for the plots correspond for example to the one-dimensional model of Fig.^ 
with t\ = 1, t 2 = —1, to = 0.1. Notice, however, that the shown relation is 
valid for m = 2 at arbitrary dimension. 

because the matrices Bj are in this case zero except -Bi(2, 2) = B 2 (l,l) = b. 
The stationarity Eq. ( |563| ) can be solved graphically (see Figs.^, |8|), the 
solution being given by the point where aie~2 ba i +d2 = (1 — ai)e~ 2 b ^ ai ^ 2+dl ^ 
or, alternatively, 

ai = -(tanhA + 1) . (569) 

That equation is analogous to the celebrated mean field equation of the 
ferromagnet. 

We conclude that in the case of equal component covariances, in addition 
to the linear low-temperature equations, only am — 1-dimensional nonlinear 
equation has to be solved to determine the 'mixing coefficients' a\, • ■ ■ , a m -\. 

6.3.3 Analytical solution of mixture models 

For regression under a Gaussian mixture model the predictive density can be 
calculated analytically for fixed 9. This is done by expressing the predictive 
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Figure 8: As in Fig.[7] the plots of /i(ai) = a\ and /2(ai) = | (tanh A + 1) 
are shown within the inverse temperature range < (3 < 4. 

density in terms of the likelihood of 9 and j, marginalized over h 

p{y\x,D,D ) = ^ dhd6 — - p(y\x,D,D ,9,j). 

j •> EjJd6p{6,j)p{y D \xD,D ,6,j) 

(570) 

(Here we concentrate on 6*. The parameter /3 can be treated analogously.) 
According to Eq. ( [48 6| ) the likelihood can be written 



p(y D \x D ,D ,6,j) = e -^o, 3 -W+ilndet(AK, (0))) (5?1) 

with 

= - t s (ff), K,(9)(tD - t,(9))) = Vj, (572) 

and K.j(9) = (K^ 1 + KjJxdW) -1 being a n x n-matrix in data space. The 
equality of Vj and £o,j can be seen using Kj — Kj(Kd + Kj) _1 Kj = K D — 
K D (K D + K, iZ)Z) )- 1 K D = K A2J2J - K j>DD (K D + Kj^Kj-^ = K. For 
the predictive mean, being the optimal solution under squared-error loss 
and log-loss (restricted to Gaussian densities with fixed variance) we find 
therefore 

y(x) = Jdyyp(y\x,D,D ) = Y l Jd9b j (9)t j (6), (573) 

3 

with, according to Eq. ( pi7| ) . 

t j {0)=t j + Kj%{t I ,-t j ) } (574) 
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Figure 9: Shown is the joint posterior density of h and /3, i.e., p(h, (3\D, D ) 
oc p(yi)\h, (3)p(h\f3, D )p(/3) for a zero-dimensional example of a Gaussian 
prior mixture model with training data yr> = 0.1 and prior data yo = 
±1 and inverse temperature j3. L.h.s.: For uniform prior (middle) p{f3) oc 
1 with joint posterior p oc e~i h2+lnf3 ^e - ^'-' 1-1 ^ + e~^^ h+l ^ the maximum 
appears at finite (3. (Here no factor 1/2 appears in front of ln/3 because 
normalization constants for prior and likelihood term have to be included.) 
R.h.s.: For compensating hyperprior p([3) oc with p oc e~ '% h ' ''~ "f ( /i ~ 1 ) 2 + 

e~ f'' 2- ! ( h+1 ) 2 the maximum is at (3 = 0. 



and mixture coefficients 

p(P,j)p(yD\xD,D ,9,j) 



bj(6) = p(9,j\D) 



Ej / d9p(6,j)p(y D \x D , D , 6,j) 

which defines £y = /3i?o,j + Ee,j- For solvable ^-integral the coefficients can 
therefore be obtained exactly. 

If bj is calculated in saddle point approximation at 9 = 9* it has the 
structure of cij in (|546Q with Eqj replaced by Ej and Kj by Kj. (The inverse 
temperature j3 could be treated analogously to 9. In that case E$j would 
have to be replaced by Eg^j.) 

Calculating also the likelihood for j, 9 in Eq. (|575|) in saddle point ap- 
proximation, i.e., p(yo\xD,D ,9*,j) ps p(y D \xD,h*)p(h*\D ,9* , j), the terms 
p{Vd\xd, h*) in numerator and denominator cancel, so that, skipping Do and 
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Figure 10: Same zero-dimensional prior mixture model for uniform hyper- 
prior on (3 as in Fig.^j, but for varying data Xj = 0.3 (left), Xd = 0.5 (right). 



bjm = ? m^n = ai(h%n (576) 

becomes equal to the in Eq. (|546|) at h — h*. 

Eq. ( |575| ) yields as stationarity equation for 9, similarly to Eq. ( |488| ) 

For fixed # and j-independent covariances the high temperature solution 
is a mixture of component solutions weighted by their prior probability 

s t 3 y EpU)t j = T,°itj = t ( 579 ) 



134 



2 „ -A 6 8 10 12 14 




Figure 11: Exact b\ and a\ (dashed) vs. /3 for two mixture components with 
equal covariances and B\(2,2) = b = 2, E\ = 0.405, E2 = 0.605. 



The low temperature solution becomes the component solution tj with min- 
imal distance between data and prior template 



y 



tj* with j* — argmim-(t£) — tj, K.j(tn — tj)). 



(580) 



Fig.|TT] compares the exact mixture coefficient b\ with the dominant solution 
of the maximum posterior coefficient a\ (see also ||131|| ) which are related 
according to ( |563|) 



/3 



g 2 3 



aBja—Ej 



bjf 



(581) 



6.4 Local mixtures 

Global mixture components can be obtained by combining local mixture 
components. Predicting a time series, for example, one may allow to switch 
locally (in time) between two or more possible regimes, each corresponding 
to a different local covariance or template. 

The problem which arises when combining local alternatives is the fact 
that the total number of mixture components grows exponentially in the 
number local components which have to be combined for a global mixture 
component. 

Consider a local prior mixture model, similar to Eq. ( |525| ), 

p ((f)\e) = e -Sdx^{x-fi{x))\^~\nZ^e) ( 5g2 ) 
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where 9{x) may be a binary or an integer variable. The local mixture variable 
9{x) labels local alternatives for filtered differences u(x;9(x)) which may 
differ in their templates t(x;9(x)) and/or their local filters W(x;9(x)). To 
avoid infinite products, we choose a discretized x variable (which may include 
the y variable for general density estimation problems), so that 

= ^ pWe -E,M^))| 2 -lnZ, W; (5g3) 

e 

where the sum Y^e ls over a U local integer variables 0(x), i.e., 

E = E---E = (nzV (584) 

Only for factorizing hyperprior p{9) = I\ x p(9(x)) the complete posterior 
factorizes 

p(0) = ([]^ ] ]J ( p (0( a; ))e-M^))l 2 - ln ^(^))) 
\ x' e{x')J x 

= IIE (^( x )) e ~ kM(:E))|2 ~ lnZ ' M(:E)) ) ' ( 585 ) 



* e(x) 



because 



UJ2{e-^^ 2 ) =Y[Z^x,9(x)). (586) 

x 9(x) x 



Under that condition the mixture coefficients ag of Eq. (|546|) can be ob- 
tained from the equations, local in 8(x), 

ae = a 6 ( Xl )...e( Xl ) = p{0\<p) = Yl a e(x) (587) 

X 

with 

p(#(x))e~ |w(x;0(x))|2 ~ lnZ ^ ;9(:c)) 
a e(x) = p ^,^y^ {x -,e> { x))\*-inz,{x-,e>(x)) ■ ( 588 ) 

For equal covariances this is a nonlinear equation within a space of dimension 
equal to the number of local components. For non-factorizing hyperprior the 
equations for different 9(x) cannot be decoupled. 
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6.5 Non— quadratic potentials 



Solving learning problems numerically by discretizing the x and y variables 
allows in principle to deal with arbitrary non-Gaussian priors. Compared to 
Gaussian priors, however, the resulting stationarity equations are intrinsically 
nonlinear. 

As a typical example let us formulate a prior in terms of nonlinear and 
non-quadratic "potential" functions ip acting on "filtered differences" uo = 
W(0 — t), defined with respect to some positive (semi-) definite inverse co- 
variance K = W T W. In particular, consider a prior factor of the following 
form 



P(<f>) 



■ J dx il>(u){x))— In 



-Eft) 



(589) 



where E((p) = Jdxip{^(x)). For general density estimation problems we 
understand x to stand for a pair (x, y). Such priors are for example used for 
image restoration |70|, |28], |16l| |43|, |242 



For differentiable ip function the functional derivative with respect to 4>(x) 
becomes 



&<Kx)P(<t>) = - e -/ da; '^'))- ln ^ Jdx"^'{uj(x"))W(x",x), 
with ip'(s) = dip(z)/dz, from which follows 

5^E{(j)) = -5 lnp(0) = WV- 



(590) 



(591) 



For nonlinear filters acting on — t, W in Eq. ( |589| ) must be replaced by 
uj'{x) = S^^uii^x). Instead of one W a "filter bank" W a with corresponding 
K a , u a , and ip a may be used, so that 



and 



■ J2 a J dx i}a(^a{x))~\nZ^ 



(592) 



(593) 



The potential functions ip may be fixed in advance for a given problem. 
Typical choices to allow discontinuities are symmetric "cup" functions with 
minimum at zero and flat tails for which one large step is cheaper than many 
small ones [ 233|| ). Examples are shown in Fig. [12| (a,b). The cusp in (b), 



where the derivative does not exist, requires special treatment [242]. Such 



137 



functions can also be interpreted in the sense of robust statistics as flat tails 
reduce the sensitivity with respect to outliers ||100( , [L01| , |67| , p6| . 



Inverted "cup" functions, like those shown in Fig. [12] (c), have been ob- 
tained by optimizing a set of ip a with respect to a sample of natural images 
242|| . (For statistics of natural images their relation to wavelet-like filters 



and sparse coding see also ||172 



While, for W which are differential operators, cup functions promote 
smoothness, inverse cup functions can be used to implement structure. For 
such W the gradient algorithm for minimizing E((p), 

0new = 0old ~ 77<^-E(0old) , (594) 

becomes in the continuum limit a nonlinear parabolic partial differential 
equation, 

T = -]TW;^(W Q (0-t)). (595) 

a 

Here a formal time variable r have been introduced so that (0 new — <Poid)/v — * 
4> T = dcj) I dr. For cup functions this equation is of diffusion type |[L70| , |183|| , if 
also inverted cup functions are included the equation is of reaction-diffusion 
type [ 242|| . Such equations are known to generate a great variety of patterns. 



Alternatively to fixing ip in advance or, which is sometimes possible for 
low-dimensional discrete function spaces like images, to approximate ip by 
sampling from the prior distribution, one may also introduce hyperparame- 
ters and adapt potentials ip(9) to the data. 

For example, attempting to adapt a unrestricted function ip(x) with hy- 
perprior p(ip) by Maximum A Posteriori Approximation one has to solve the 
stationarity condition 

= <^ (s ) \np(4>, if;) = <^ (s ) lnp(</>|V>) + <V( S ) lnp(^). (596) 

From ^ 

ti<Ks)P(<l>\if>) = -P{<t>\^) dx5{s- u{x)) - —^^(s)^, (597) 

J z 

it follows 

- <^( s) \np(<j)\i/j) = n(s)- < n(s) >, (598) 

with integer 

n(s) = (dx5(s-uj(x)), (599) 
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(b) 




Figure 12: Non-quadratic potentials of the form ip(x) = a(1.0 — 1/(1 + (\x — 
x0|/6) 7 )), l242j : "Diffusion terms": (a) Winkler's cup function (a= 5, 
b = 10, 7 = 0.7, xo = 0), (b) with cusp (a= 1, b — 3, 7 = 2, xq = 0), (c) 
"Reaction term" (a = -4.8, 6 = 15, 7 = 2.0 x = 0). 
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being the histogram of the filtered differences, and average histogram 



<n(s)> = Jd(j)p{4)\ijj)n{s). (600) 

The right hand side of Eq. (|598|) is zero at 0* if, e.g., p(0|VO — 8(<f> ~ 4>*)i 
which is the case for ip(uj(x; 0)) = (3 (uj(x; (f>) — u(x; 4>*)) 2 in the (3 — > oo limit. 

Introducing hyperparameters one has to keep in mind that the resulting 
additional flexibility must be balanced by the number of training data and 
the hyperprior to be useful in practice. 



7 Iteration procedures: Learning 

7.1 Numerical solution of stationarity equations 

Due to the presence of the logarithmic data term — (InP, N) and the normal- 
ization constraint in density estimation problems the stationary equations 
are in general nonlinear, even for Gaussian specific priors. An exception are 
Gaussian regression problems discussed in Section |3.7| for which — (InP, N) 
becomes quadratic and the normalization constraint can be skipped. How- 
ever, the nonlinearities arising from the data term — (InP, N) are restricted 
to a finite number of training data points and for Gaussian specific priors one 
may expect them, like those arising from the normalization constraint, to be 
numerically not very harmful. Clearly, severe nonlinearities can appear for 
general non-Gaussian specific priors or general nonlinear parameterizations 
I'iO- 

As nonlinear equations the stationarity conditions have in general to be 
solved by iteration. In the context of empirical learning iteration procedures 
to minimize an error functional represent possible learning algorithms. 

In the previous sections we have encountered stationarity equations 

= = (6 oi) 

for error functionals E^, e.g., <fi = L or <fi = P, written in a form 

K(j) = T. (602) 



with ^-dependent T (and possibly K). For the stationarity Eqs. (|143|) , (|171|) 



and (|192|) the operator K is a ^-independent inverse covariance of a Gaussian 
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specific prior. It has already been mentioned that for existing (and not too 
ill-conditioned) K 1 (representing the covariance of the prior process) Eq. 
( |602| ) suggests an iteration scheme 



= k- 1 T(0 w ), (603) 



for discretized starting from some initial guess (fr- In general, like for the 
non-Gaussian specific priors discussed in Section [| K can be (^-dependent. 
Eq. ( |358| ) shows that general nonlinear parameterizations P(£) lead to non- 
linear operators K. 

Clearly, if allowing ^-dependent T, the form (|602 ) is no restriction of 



generality. One always can choose an arbitrary invertible (and not too ill- 
conditioned) A, define 

T A = G(<j>) + A0, (604) 
write a stationarity equation as 

A0 =T A , (605) 

discretize and iterate with A -1 . To obtain a numerical iteration scheme we 
will choose a linear, positive definite learning matrix A. The learning matrix 
may depend on and may also change during iteration. 

To connect a stationarity equation given in form ( |602j ) to an arbitrary 
iteration scheme with a learning matrix A we define 

B = K — A, B„ = K--A, (606) 

V 

i.e., we split K into two parts 

K = A + B = -A + B„, (607) 
V 

where we introduced r\ for later convenience. Then we obtain from the sta- 
tionarity equation (|602|) 

ct) = r]A- 1 {T -B v </>). (608) 

To iterate we start by inserting an approximate solution 0^ to the right- 
hand side and obtain a new 0^ +1 ^ by calculating the left hand side. This can 
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be written in one of the following equivalent forms 



= 77A- 1 (T« - B„0«) (609) 
= (1 - r/)0 (l) + r]A- 1 (t {i) - B0 W ) (610) 



= W + r^A- 1 (T (i) - K0 W ) , (611) 

where rj plays the role of a learning rate or step width, and A -1 = ^A^^ 
may be iteration dependent. The update equations ( |609| - |611| ) can be written 

A0 W = rjA^G^), (612) 

with A0« = - Eq. (§TJ does not require the calculation of B 

or B,, so that instead of A directly A -1 can be given without the need to 
calculate its inverse. For example operators approximating K _1 and being 
easy to calculate may be good choices for A -1 . 

For positive definite A (and thus also positive definite inverse) conver- 
gence can be guaranteed, at least theoretically. Indeed, multiplying with 
(1/77) A and projecting onto an infinitesimal dip gives 



-(d<P, AA0)=(#,^) 

f] v 0(f) 



d(-E^). (613) 



In an infinitesimal neighborhood of where A0^ becomes equal to d<p 
in first order the left-hand side is for positive (semi) definite A larger (or 
equal) to zero. This shows that at least for r] small enough the posterior 
log-probability — increases i.e., the differential dE^ is smaller or equal to 
zero and the value of the error functional decreases. 

Stationarity equation ( |127| ) for minimizing El yields for ( |609| , |610| . |611| ). 

= ^(n-A^^® -KL® + ~AL®\ (614) 

= (1 - r/)L (i) + 7/A- 1 (N - Af e iW - KL (i) + AL (i) ) (615) 
= L«+7 7 A- 1 (iV-Afe iW -K J L»). (616) 



The function is also unknown and is part of the variables we want to solve 
for. The normalization conditions provide the necessary additional equations, 
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and the matrix A 1 can be extended to include the iteration procedure for 
Ax- For example, we can insert the stationarity equation for Ax in (|616|) to 

get 

L (i+i) = L (i) + vA -i ^ N _ e LW ( Nx _ i x KL) - KL W ] . (617) 

If normalizing ifi at each iteration this corresponds to an iteration procedure 
for g = L + In Zx ■ 

Similarly, for the functional Ep we have to solve (|165|) and obtain for 



pd+i) = p(i) + vA -i r T w _ K p® 



(618) 

P w + V A- 1 (p (irl iV - A? - KP W ) (619) 
P (i) + riA- 1 (P (irl A^ -N x - IxP (i) KP (i) - KP W ) . (620) 



Again, normalizing P at each iteration this is equivalent to solving for z 
ZxP, and the update procedure for Ax can be varied. 



7.2 Learning matrices 

7.2.1 Learning algorithms for density estimation 

There exists a variety of well developed numerical methods for unconstraint 
as well as for constraint optimization ||185| , [57], |88|, |191| , [8!^ , p] , pL9|, |80| , |188 



Popular examples include conjugate gradient, Newton, and quasi-Newton 
methods, like the variable metric methods DFP (Davidon-Fletcher-Powell) 
or BFGS (Broyden-Fletcher-Goldfarb-Shanno). 

All of them correspond to the choice of specific, often iteration dependent, 
learning matrices A defining the learning algorithm. Possible simple choices 
are: 



A = I 
A = D 
A = L + D 
A = K 



Gradient 
Jacobi 

Gauss-Seidel 
prior relaxation 



(621) 
(622) 
(623) 
(624) 



where I stands for the identity operator, D for a diagonal matrix, e.g. the 
diagonal part of K, and L for a lower triangular matrix, e.g. the lower tri- 
angular part of K. In case K represents the operator of the prior term in 
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an error functional we will call iteration with K 1 (corresponding to the co- 
variance of the prior process) prior relaxation. For 0-independent K and T, 
r] — 1 with invertible K the corresponding linear equation is solved by prior 
relaxation in one step. However, also linear equations are solved by iteration 
if the size of K is too large to be inverted. Because of I -1 = I the gradient 
algorithm does not require inversion. 

On one hand, density estimation is a rather general problem requiring the 
solution of constraint, inhomogeneous, nonlinear (integro-) differential equa- 
tions. On the other hand, density estimation problems are, at least for Gaus- 
sian specific priors and non restricting parameterization, typically "nearly" 
linear and have only a relatively simple non-negativity and normalization 
constraint. Furthermore, the inhomogeneities are commonly restricted to a 
finite number of discrete training data points. Thus, we expect the inversion 
of K to be the essential part of the solution for density estimation problems. 
However, K is not necessarily invertible or may be difficult to calculate. 
Also, inversion of K is not exactly what is optimal and there are improved 
methods. Thus, we will discuss in the following basic optimization methods 
adapted especially to the situation of density estimation. 

7.2.2 Linearization and Newton algorithm 

For linear equations K</> = T where T and K are no functions of a spectral 
radius p(M) < 1 (the largest modulus of the eigenvalues) of the iteration 
matrix 

M = -r]A- l B, n = (1 - 77)1 - rjA^B = I - rjA^K (625) 

would guarantee convergence of the iteration scheme. This is easily seen by 
solving the linear equation by iteration according to ( |609| ) 

(m) = rjA^T + M0 W (626) 

= tiA^T + 7 ] MA- 1 T + M 2 (j) {l - 1) (627) 

00 

= 77 £ M n A _1 T. ( 628 ) 

n=0 

A zero mode of K, for example a constant function for differential operators 
without boundary conditions, corresponds to an eigenvalue 1 of M and would 
lead to divergence of the sequence 0^. However, a nonlinear T{<p) or K(0), 
like the nonlinear normalization constraint contained in T(<f)), can then still 
lead to a unique solution. 
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A convergence analysis for nonlinear equations can be done in a linear 
approximation around a fixed point. Expanding the gradient at 0* 



G{<f>) 



S(-E 4 



+ 



H( 



+ 



(629) 



shows that the factor of the linear term is the Hessian. Thus in the vicinity 
of 4>* the spectral radius of the iteration matrix 



M = I + t ? A 1 H, 



(630) 



determines the rate of convergence. The Newton algorithm uses the negative 
Hessian — H as learning matrix provided it exists and is positive definite. 
Otherwise it must resort to other methods. In the linear approximation (i.e., 
for quadratic energy) the Newton algorithm 



A 



H 



Newton 



(631) 



is optimal. We have already seen in Sections |3.1.3| and |3.2.3| that the inho- 
mogeneities generate in the Hessian in addition to K a diagonal part which 
can remove zero modes of K. 



7.2.3 Massive relaxation 

We now consider methods to construct a positive definite or at least invertible 
learning matrix. For example, far from a minimum the Hessian H may not 
be positive definite and like a differential operator K with zero modes, not 
even invertible. Massive relaxation can transform a non-invertible or not 
positive definite operator A , e.g. A = K or A = — H, into an invertible 
or positive definite operators: 

A = A + m 2 I : Massive relaxation (632) 

A generalization would be to allow m = m(x,y). This is, for example, used 
in some realizations of Newton's method for minimization in regions where 
H is not positive definite and a diagonal operator is added to — H, using for 



example a modified Cholesky factorization |L9| . The mass term removes the 
zero modes of K if — m 2 is not in the spectrum of Ao and makes it positive 
definite if m 2 is larger than the smallest eigenvalue of A . Matrix elements 
( 0, (A — zl)^ 1 cf)) of the resolvent A~ 1 (z), z = —m 2 representing in this 
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complex variable, have poles at discrete eigenvalues of A and a cut 
at the continuous spectrum as long as has a non-zero overlap with the 
corresponding eigenfunctions. Instead of multiples of the identity, also other 
operators may be added to remove zero modes. The Hessian in ( |156| ), for 
example, adds a x-dependent mass-like, but not necessarily positive definite 
term to K. Similarly, for example Hp in ( |181| ) has (x, y)-dependent mass 
P~ 2 N restricted to data points. 

While full relaxation is the massless limit m 2 — > of massive relaxation, 
a gradient algorithm with rj' can be obtained as infinite mass limit m 2 — > oo 
with 77 — > 00 and m 2 /rj = l/i]'. 

Constant functions are typical zero modes, i.e., eigenfunctions with zero 
eigenvalue, for differential operators with periodic boundary conditions. For 
instance for a common smoothness term —A (kinetic energy operator) as 
regularizing operator K the inverse of A = K + m 2 I has the form 

A-\x',y';x,y)= -i 2 - (633) 
—A + m z 

rl d xh rl d Yh pik x {x-x')+ik y (y-y') 

, (27i) d kl + kl + m? ' 1 J 

with d = dx + dy, dx = dim(X), dy = dim(y). This Green's function or 
matrix element of the resolvent kernel for Ao is analogous to the (Euclidean) 
propagator of a free scalar field with mass m, which is its two-point corre- 
lation function or matrix element of the covariance operator. According to 
1/x = f£°dte~ xt the denominator can be brought into the exponent by intro- 
ducing an additional integral. Performing the resulting Gaussian integration 
over k = (k x ,k y ) the inverse can be expressed as 

A _1 (x', y'\ x, y; m) = m d_2 A _1 (m(x — x'),m(y — y')\ 1) 



= (27r)- d / 2 (- j^- -] K {d „ 2)/2 (m\x-x'\+m\y-y'\), (635) 

\\x — x'\ + \y — y\ J 

in terms of the modified Bessel functions K v (x) which have the following 
integral representation 



(d-2)/2 




K v {2J{3 1 ) = -[±\ / dtf^et-v. (636) 
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Alternatively, the same result can be obtained by switching to ^-dimensional 
spherical coordinates, expanding the exponential in ultra-spheric harmonic 
functions and performing the integration over the angle- variables |[L17|| . For 
the example d — 1 this corresponds to Parzens kernel used in density esti- 
mation or for d = 3 

A-~ 1 (x',y';x,y) = — -m|,-*'|-H^'l (637) 

A7t\x — x I + 4:7r\y — y \ 

The Green's function for periodic, Neumann, or Dirichlet boundary con- 
ditions can be expressed by sums over A~ 1 (x', y'\ x, y) JTTJ. 

The lattice version of the Laplacian with lattice spacing a reads 

A /H = \ E[/(" - a j) - 2/(n) + f(n + a,)], (638) 

3 

writing cij for a vector in direction j and length a. Including a mass term we 
get as lattice approximation for A 

1 d x 

A(n x ,n y ;m x ,m y ) = -^£^(^-2^ + ^) 

a i=i 

2 ^""^ 3n x ,m x {$n y +a v -,my ^n y ,m y ~\~ $n y -a v . ,m y ) m ^n x ,m x ^n y ,m y (639) 
° 3=1 

Inserting the Fourier representation ( |101| ) of 8{x) gives 

A(n x ,n v ;m x ,m v ) = ^ f ^ e 'fa(">-"'^M%-%) 
V ' y ' ' W a 2 i-Tr 2vr d 



m 2 a 2 1 , 1 dY 



x ( 1 + - - X>os/^ - - Y/xxkyj | , (640) 



with k X) i = k x a°?, cos k y j = cos k y Oj and inverse 



2 r7T Jdxl. A^Yh 

j Li rv x tX/y 



ik x {n x —m x )+ik y (n y —m y ) 



2dJ-« (2ttY 1+ 2^ - 1 cos ^ - i i cos A;, 
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(641) 



(For m = and d < 2 the integrand diverges for k — > (infrared divergence). 
Subtracting formally the also infinite A _1 (0, 0; 0, 0) results in finite difference. 
For example in d = 1 one finds A~ 1 (n y ; m y ) — A -1 (0; 0) = — (l/2)\n y — m y \ 
p3| . Using l/x = J™dte~ xt one obtains ]T90[ 



A-\k x ,k y ) = 1 r^ e -^- 2 <E^co S ^, + Efco Sfca 

2 «/ o 



(642) 



with fj, = d/a 2 + m 2 /2. This allows to express the inverse A -1 in terms of 
the modified Bessel functions I v {n) which have for integer argument n the 
integral representation 

I v (n) = - rdQe ncose cos{uQ). (643) 

7T JO 

One finds 

A- 1 {n x1 n y -m x ,m y ) = - / JJ K \n Xti -n' x ^(t/a 2 ) II K \rn v j-m'.\( t / a2 )- 

/ J0 i=l j=l 

(644) 

It might be interesting to remark that the matrix elements of the inverse 
learning matrix or free massive propagator on the lattice A -1 (a/, y'\ x, y) can 
be given an interpretation in terms of (random) walks connecting the two 
points (x',y') and (x, y) |56], |190|| . For that purpose the lattice Laplacian is 
splitted into a diagonal and a nearest neighbor part 

- A = -4 (2tfl - W) , (645) 
a 2 

where the nearest neighbor matrix W has matrix elements equal one for 
nearest neighbors and equal to zero otherwise. Thus, 

(_A + m 2 ) _1 = — (i_J_wl 1 = —y (-^-Yw n , (646) 

can be written as geometric series. The matrix elements W n (a/, y'\ x, y) give 
the number of walks w[(x' , y') — > (x, y)\ of length |iy| = n connecting the two 
points {x',y') and (x,y). Thus, one can write 

(-A + m 2 )" tf,i/;x,v) = — £ (— 2 ) . (647) 
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7.2.4 Gaussian relaxation 

As Gaussian kernels are often used in density estimation and also in function 



approximation (e.g. for radial basis functions ||186|| ) we consider the example 



oo x / M 2\* 



A = ^2 — — - = : Gaussian (648) 

k =o k - \ 2<T / 

with positive semi-definite M 2 . The contribution for k = corresponds to 
a mass term so for positive semi-definite M this A is positive definite and 
therefore invertible with inverse 

A _1 = e"^, (649) 

which is diagonal and Gaussian in M-representation. In the limit a — > oo or 
for zero modes of M the Gaussian A -1 becomes the identity I, corresponding 
to the gradient algorithm. Consider 

M 2 (x', y'; x, y) = -5(x - x')5{y - y')A (650) 

where the ^-functions are usually skipped from the notation, and 

. d 2 d 2 
A = — + 



dx 2 dy 2 ' 

denotes the Laplacian. The kernel of the inverse is diagonal in Fourier rep- 
resentation 

A(k' x , k' y , , k x , k y ) = 5(k x - k' x )5(k y - k' y )e~ k -^ (651) 
and non-diagonal, but also Gaussian in (x, ^-representation 

A~ 1 (x',y';x,y) = e~& = e ~%W- (652) 

J {2TT) d 

= ( -i-Ve-^K— 'THv-v>?) = 1 _ e - ( - ,) >-" ,)a > (6 53) 



2tt 



with a = 1/a and d = dx + dy, dx = dim(X), dy = dim(F). 
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7.2.5 Inverting in subspaces 

Matrices considered as learning matrix have to be invertible. Non-invertible 
matrices can only be inverted in the subspace which is the complement of 
its zero space. With respect to a symmetric A we define the projector Qo = 
I — J2iipJ ' ipi m to its zero space (for the more general case of a normal A 
replace ipf by the hermitian conjugate ipj) and its complement Qi = I — Q = 
J2i ipi with ipi denoting orthogonal eigenvectors with eigenvalues ai ^ of 
A, i.e., Aipi = diipi 7^ 0. Then, denoting projected sub-matrices by QjAQj 
= Aij we have A o = Ai = A i = 0, i.e., 

A = QiAQx = An. (654) 

and in the update equation 

AA0 w = V G (655) 

only An can be inverted. Writing Qj(p = (fij for a projected vector, the 
iteration scheme acquires the form 

A0f } = tjA^Gi, (656) 
= r]G . (657) 

For positive semi-definite A the sub-matrix An is positive definite. If the 
second equation is already fulfilled or its solution is postponed to a later 
iteration step we have 

0i +1) = 0«+ V A£ (Tf - K<M° " kM ] ) , (658) 

= 0?- (659) 

In case the projector Q = Io is diagonal in the chosen representation the 
projected equation can directly be solved by skipping the corresponding com- 
ponents. Otherwise one can use the Moore-Penrose inverse A* of A to solve 
the projected equation 

A0 W = r]A*G. (660) 

Alternatively, an invertible operator A 00 can be added to An to obtain a 
complete iteration scheme with A -1 = A n x + Aqq 1 

(m) = ^)+^ A - 1 (T«-KM ) -KSiJrf ) ) 

+77A- 1 (Tf - Kg0« - KM ] ) ■ (661) 
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The choice A -1 = (A n + loo) -1 = A-u + loo, = A-n + Qo, for instance, 
results in a gradient algorithm on the zero space with additional coupling 
between the two subspaces. 

7.2.6 Boundary conditions 

For a differential operator invertability can be achieved by adding an operator 
restricted to a subset B C X x Y (boundary). More general, we consider an 
projector Q B on a space which we will call boundary and the projector on 
the interior Q/ = I — Q B . We write = K. ki for k,l e {I,B}, and 

require Kg/ = 0. That means K is not symmetric, but K// can be, and we 
have 

K = (I - Q B )K + Q B KQ B = K H + K IB + K BB . (662) 
For such an K an equation of the form K0 = T can be decomposed into 

K BB <f) B = T B , (663) 
KjB&B + Knfa = Tj, (664) 

with projected <pk = Qk<fi, Tk = QfcT, so that 

<j) B = K BB T B , (665) 
0, = KJ/ (T, - K IB K BB T B ) . (666) 

The boundary part is independent of the interior, however, the interior can 
depend on the boundary. A basis can be chosen so that the projector onto 
the boundary is diagonal, i.e., 

Qb = Ib= E ® <%)) ® (6(xj) ® 5(%)) T . 

j:(xj,yj)eB 

Eliminating the boundary results in an equation for the interior with adapted 
inhomogeneity. The special case K BB = 1 BB , i.e., <p B = T B on the boundary, 
is known as Dirichlet boundary conditions. 

As trivial example of an equation K0 = T with boundary conditions, 
consider a one-dimensional finite difference approximation for a negative 
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Laplacian K, adapted to include boundary conditions as in Eq. 
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Then Eq. (|663|) is equivalent to the boundary conditions, 
the interior equation Eq. ( |664| ) reads 
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(667) 



b, and 



(668) 



(Useful references dealing with the numerical solution of partial differential 
equations are, for example, 
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Similarly to boundary conditions for K, we may use a learning matrix A 
with boundary conditions (corresponding for example to those used for K): 



A = A// + A/b + A BB 
A = A n + A IB + I B b 



Boundary 
Dirichlet boundary 



(669) 
(670) 



For linear A bb the form ( |669| ) corresponds to general linear boundary condi- 
tions. (It is also possible to require nonlinear boundary conditions.) An can 
be chosen symmetric, and therefore positive definite, and the boundary of A 
can be changed during iteration. Solving A(0( J+1 ) — 0W) = r){T^> — KW0w) 
gives on the boundary and for the interior 



'b 



+ V&BB {T } 

>', - //AT, Hf-K,, 



B 



-1 frp(i) If® A*) _ J£ 



KM 

(«) AV 



K 



B/V 9 / 



(671) 



1BVB 



(,(*+!) 
B 



-(i) 



r£° and K 



b 



For fulfilled boundary conditions with 0^ = 

for ^A^b — >■ so the boundary is not updated, the term 0^ 
Otherwise, inserting the first in the second equation gives 



(<) 



Bi 



(672) 
= 0, or 



(*+•!) 

~ VB 



>(*+!) 



4 + ^ A 7/ ] 
-17A7/A 



(0 



K 



(<U(0 



K (i) 



vanishes. 



(673) 



IB^BB \ T B 



(0 
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w ^0 

BBVB 
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Even if K is not defined with boundary conditions, an invertible A can be 
obtained from K by introducing a boundary for A. The updating process 
is then restricted to the interior. In such cases the boundary should be 
systematically changed during iteration. Block-wise updating of <fi represent 
a special case of such learning matrices with variable boundary. 

The following table summarizes the learning matrices we have discussed 
in some detail for the setting of density estimation (for conjugate gradient 



and quasi-Newton methods see, for example, ||191]| ): 



Learning algorithm 


Learning matrix 


Gradient 
Jacobi 
Gauss-Seidel 
Newton 
prior relaxation 
massive relaxation 
linear boundary 
Dirichlet boundary 

Gaussian 


A = I 
A = D 
A = L + D 
A = -H 
A = K 

A = A + m 2 I 

A = A// + A IB + A BB 

A = A H + Aib + Ibb 

a _ V oo 1 (M 2 \ k _ j£ 
A-L fe= 0fei(2^J 



7.3 Initial configurations and kernel methods 
7.3.1 Truncated equations 

To solve the nonlinear Eq. fl603j ) by iteration one has to begin with an ini- 
tial configuration <^>. In principle any easy to use technique for density 
estimation could be chosen to construct starting guesses <j>( ' . 

One possibility to obtain initial guesses is to neglect some terms of the full 
stationarity equation and solve the resulting simpler (ideally linear) equation 
first. The corresponding solution may be taken as initial guess for solving 
the full equation. 

Typical error functionals for statistical learning problems include a term 
(L, N) consisting of a discrete sum over a finite number n of training data. 
For diagonal P' those contributions result (|345|) in n 5-peak contributions to 
the inhomogeneities T of the stationarity equations, like J2i 8(x — Xi)5(y — tji) 
in Eq. ( |143| ) or J2i $( x ~ x i)^(y — Hi) I P{ x i y) i n Eq. (|171| ). To find an initial 
guess, one can now keep only that 5-peak contributions T$ arising from the 
training data and ignore the other, typically continuous parts of T. For ( |143| ) 
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and ( |171| ) this means setting Ax = and yields a truncated equation 

K(j) = P'P" 1 ^ = T s . (674) 
Hence, can for diagonal P' be written as a sum of n terms 

0(g»y) = J2C(x,y;x i ,y i ) P J' X " y ^ (675) 

i=l r\Xi,yi) 

with C = K _1 , provided the inverse K -1 exists. For El the resulting trun- 
cated equation is linear in L. For Ep, however, the truncated equations 
remains nonlinear. Having solved the truncated equation we restore the 
necessary constraints for 0, like normalization and non-negativity for P or 
normalization of the exponential for L. 

In general, a C ^ K 1 can be chosen. This is necessary if K is not 
invertible and can also be useful if its inverse is difficult to calculate. One 
possible choice for the kernel is the inverse negative Hessian C = — H 1 
evaluated at some initial configuration or an approximation of it. A 
simple possibility to construct an invertible operator from a noninvertible K 
would be to add a mass term 

C = (K + m^l) _1 , (676) 

or to impose additional boundary conditions. 

Solving a truncated equation of the form ( |675|) with C means skipping 
the term — C(P'Ax + (K — C~ x )0) from the exact relation 

= CP'P ^ - C(P'Ax + (K - Cr 1 )^). (677) 

A kernel used to create an initial guess will be called an initializing 
kernel. 

A similar possibility is to start with an "empirical solution" 



y emp; 



(678) 



where em p is defined as a which reproduces the conditional empirical 
density P emp of Eq. ( |235| ) obtained from the training data, i.e., 

Pcmp = P(0cmp). (679) 

In case, there are not data points for every x-value, a correctly normalized 
initial solution would for example be given by -P emp defined in Eq. ( 237 ). If 
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zero values of the empirical density correspond to infinite values for 0, like 
in the case <f> = L, one can use P* mp as defined in Eq. fl238|) , with small e, to 
obtain an initial guess. 

Similarly to Eq. (|675 ), it is often also useful to choose a (for example 
smoothing) kernel C and use as initial guess 



C0 emp , (680) 



or a properly normalized version thereof. Alternatively, one may also let the 
(smoothing) operator C directly act on P emp and use a corresponding </> as 
initial guess, 

0W = (0)(-DcP emp ), (681) 

assuming an inverse (0) < - -1 ^ of the mapping P(<f>) exists. 

We will now discuss the cases 6 = L and 6 = P in some more detail. 



7.3.2 Kernels for L 

For El we have the truncated equation 

L = CN. (682) 
Normalizing the exponential of the solution gives 

n . 

L(x, y) = J £ C(x, y; x u Vi ) - In / dy' c ^'^, (683) 

or 

L = CN — \nl x e CN . (684) 

Notice that normalizing L according to Eq. ( p83| ) after each iteration the 
truncated equation ( 682 ) is equivalent to a one-step iteration with uniform 
p(o) _ e L accorc li n g to 

L 1 = CiV + CP (0) Ax, (685) 

where only (I — CK)L is missing from the nontruncated equation ( J677 ), 
because the additional y-independent term CP^Ax becomes inessential if 
L is normalized afterwards. 

Lets us consider as example the choice C = — H~ 1 (0^) for uniform initial 
L^ ) = c corresponding to a normalized P and KL^ ^ = (e.g., a differential 
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operator). Uniform L^ ) means uniform P^ = l/v y , assuming that v y = Jdy 
exists and, according to Eq. (|137|) , Ax = N x for KL® = 0. Thus, the 
Hessian ( |160| ) at L^ )) is found as 



H(LW) = -[l-k]K(l-^)-(l-^)^ = -CT\ (686) 



Vy J Y Vy J y Vy J Vy 



which can be invertible due to the presence of the second term. 

Another possibility is to start with an approximate empirical log-density, 
defined as 

with P e e mp given in Eq. ( |238|) . Analogously to Eq. (|680|) , the empirical log- 



density may for example also be smoothed and correctly normalized again, 
resulting in an initial guess, 

L (0) = CL e emp - In I x e CL -p . (688) 

Similarly, one may let a kernel C, or its normalized version C defined below 
in Eq. (|692|) , act on P em p first and then take the logarithm 

L<°> = ln(CP^ p ). (689) 

Because already CP emp is typically nonzero it is most times not necessary to 
work here with P^ mp . Like in the next section P emp may be also be replaced 
by -Pemp as defined in Eq. ( [237] ). 

7.3.3 Kernels for P 

For Ep the truncated equation 

P = CP _1 iV, (690) 

is still nonlinear in P. If we solve this equation approximately by a one- 
step iteration P 1 = C(P^)~ 1 N starting from a uniform initial P^ and 
normalizing afterwards this corresponds for a single x-value to the classical 
kernel methods commonly used in density estimation. As normalized density 
results 
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i.e., 

P = N K ] X CN = CN, (692) 

with (data dependent) normalized kernel C = N^C and lSc,x the diagonal 
matrix with diagonal elements IxCN. Again C = K 1 or similar invertible 
choices can be used to obtain a starting guess for P. The form of the Hessian 
(|181|) suggests in particular to include a mass term on the data. 

It would be interesting to interpret Eq. (|692|) as stationarity equation of 



a functional Ep containing the usual data term ^2 i \nP(xi, Vi)- Therefore, to 
obtain the derivative P~ 1 N of this data term we multiply for existing C _1 
Eq. ( |692| ) by P _1 C _1 , where P ^ at data points, to obtain 

Q-ip = p-i^ ( 693 ) 

with data dependent 

C (X, y; x , y ) = -— -. 694 

Ei C(a;, y; x h yt) 

Thus, Eq. ( p92| ) is the stationarity equation of the functional 

E P = -(N,\nP) + hp,C- 1 P). (695) 

To study the dependence on the number n of training data for a given C 
consider a normalized kernel with JdyC(x,y;x',y') = A, Wx,x',y'. For such 
a kernel the denominator of C is equal to nX so we have 

C = 4, P = ^ (696) 

Assuming that for large n the empirical average (1/n) J2i C(x, y; Xi, yi) in the 
denominator of C -1 becomes n independent, e.g., converging to the true av- 
erage n Jdx'dy 1 p(x', y')C(x, y; x', y'), the regularizing term in functional (|695|) 
becomes proportional to n 

C- 1 oc n\ 2 , (697) 

According to Eq. ( f76|) this would allow to relate a saddle point approximation 
to a large n-limit. 

Again, a similar possibility is to start with the empirical density P em p 
defined in Eq. (|237|) . Analogously to Eq. (|680|) , the empirical density can for 
example also be smoothed and correctly normalized again, so that 

= CP cmp . (698) 
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with C defined in Eq. 

Fig. [13] compares the initialization according to Eq. ( |691| ), where the 

smoothing operator C acts on N, with an initialization according to Eq. 

( 698|) , where the smoothing operator C acts on the correctly normalized 
p 

1 emp- 



7.4 Numerical examples 

7.4.1 Density estimation with Gaussian specific prior 

In this section we look at some numerical examples and discuss implemen- 
tations of the nonparametric learning algorithms for density estimation we 
have discussed in this paper. 

As example, consider a problem with a one-dimensional X-space and a 
one-dimensional F-space, and a smoothness prior with inverse covariance 

K = X x (K x ® ly) + X y (l x ® Ky) , (699) 

where 

K x = A Ix - X 2 A X + A 4 A 2 - A 6 A^ (700) 

Ky = A Iy - A 2 A, + A 4 Aj - A 6 Aj, (701) 

and Laplacian 

A x (x, x') = 8"(x - x') = 8(x - x')^-, (702) 

and analogously for A y . For A 2 ^ = A = A 4 = A 6 this corresponds to the 
two Laplacian prior factors A x for x and A y for y. (Notice that also for A^ = 
X y the A 4 - and A 6 -terms do not include all terms of an iterated 2-dimensional 
Laplacian, like A 2 = (A^ + A y ) 2 or A 4 , as the mixed derivatives A^A^ are 
missing.) 

We will now study nonparametric density estimation with prior factors 
being Gaussian with respect to L as well as being Gaussian with respect to 
P. 

The error or energy functional for a Gaussian prior factor in L is given 
by Eq. (|108|) . The corresponding iteration procedure is 

L {i+1) = L (i) + vA -l (ft _ KL (i) _ e L 



N 



X 



IvKL w 



(703) 
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Written explicitly for A 2 = 1, A = A 4 = A 6 = Eq. ( |703| ) reads, 

L^ i+1 \x, y) = L®(x, y)+vT / A~ l (x, y; x 3 , Vj ) (704) 



+ V Jdx'dy'A-\x,y;x',y') ^L®(x' ,y') + -£^LV(x> ,y>) 
^5(x>- Xj ) + fd y »-^LV(x',y")+ jd^^L^A^^ 



Here 



VB 



VA 



vanishes if the first derivative -^L^(x,y) vanishes at the boundary or if 
periodic. 

Analogously, for error functional Ep ( |163| ) the iteration procedure 



P (i+i) = P (i) + vA -i [(pW)-ijv — Nx — I x P (i) KP w - KP (i) 
becomes for A2 = 1, Ao = A4 = A6 = 

P^)(,,,) = P«(a:,,) + ,E A p^^f ) 



(705) 



(706) 



+77 dx'dy'A (x,y;x',y' 



, „^P^(x',y") 



J2s(x'-x j )+jd y "p^(x', y " J d{x , ? 

+ Jdy"P^(x',y" 



d( y »y 



Here 



VA V {X,V) %") 2 



1 ' yj %") 



'•'"; // A//" : (.'''.//'! N 



tJA 



tJA 



dy" 



(707) 
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where the first term vanishes for pW periodic or vanishing at the boundaries. 
(This has to be the case for id/dy to be hermitian.) 

We now study density estimation problems numerically. In particular, 
we want to check the influence of the nonlinear normalization constraint. 
Furthermore, we want to compare models with Gaussian prior factors for L 
with models with Gaussian prior factors for P. 

The following numerical calculations have been performed on a mesh of 
dimension lOx 15, i.e., x G [1,10] and y G [1,15], with periodic boundary 
conditions on y and sometimes also in x. A variety of different iteration and 
initialization methods have been used. 



Figs. 14 - 17 summarize results for density estimation problems with only 



two data points, where differences in the effects of varying smoothness priors 
are particularly easy to see. A density estimation with more data points can 
be found in Fig. |2T| . 

For Fig. |14]a Laplacian smoothness prior on L has been implemented. The 
solution has been obtained by iterating with the negative Hessian, as long 
as positive definite. Otherwise the gradient algorithm has been used. One 
iteration step means one iteration according to Eq. ( |611| ) with the optimal 
rj. Thus, each iteration step includes the optimization of rj by a line search 
algorithm. (For the figures the Mathematica function FindMinimum has 
been used to optimize rj.) 

As initial guess in Fig. [14] the kernel estimate = ln(CP emp ) has been 
employed, with C defined in Eq. ( |692D and C = (K+m^I) with squared mass 
rr? G = 0.1. The fast drop-off of the energy El within the first two iterations 
shows the quality of this initial guess. Indeed, this fast convergence seems 
to indicate that the problem is nearly linear, meaning that the influence of 
the only nonlinear term in the stationarity equation, the normalization con- 
straint, is not too strong. Notice also, that the reconstructed regression shows 
the typical piecewise linear approximations well known from one-dimensional 
(normalization constraint free) regression problems with Laplacian prior. 

Fig. [15| shows a density estimation similar to Fig. [TJ], but for a Gaussian 
prior factor in P and thus also with different A2, different initialization, and 
slightly different iteration procedure. For Fig. 15 also a kernel estimate P^ 
= (CP emp ) has been used as initial guess, again with C as defined in Eq. 



and C = (K + m^I) but with squared mass mr c = 1.0. The solution has been 
obtained by prior relaxation A = K + m 2 l including a mass term with m 2 
= 1.0 to get for a Laplacian K = —A and periodic boundary conditions an 
invertible A. This iteration scheme does not require to calculate the Hessian 
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Hp at each iteration step. Again the quality of the initial guess (and the 
iteration scheme) is indicated by the fast drop-off of the energy Ep during 
the first iteration. 

Because the range of P-values, being between zero and one, is smaller 
than that of L-values, being between minus infinity and zero, a larger Lapla- 
cian smoothness factor A2 is needed for Fig. 15 to get similar results than for 
Fig. [14]. In particular, such A2 values have been chosen for the two figures 
that the maximal values of the the two reconstructed probability densities P 
turns out to be nearly equal. 

Because the logarithm particularly expands the distances between small 
probabilities one would expect a Gaussian prior for L to be especially effective 
for small probabilities. Comparing Fig. ^ and Fig. [T5] this effect can indeed 
be seen. The deep valleys appearing in the L-landscape of Fig. [15] show that 
small values of L are not smoothed out as effectively as in Fig. [H] Notice, 
that therefore also the variance of the solution p(y\x, h) is much smaller for 
a Gaussian prior in P at those x which are in the training set. 

Fig. [16] resumes results for a model similar to that presented in Fig. 
|14| , but with a (— A 3 )-prior replacing the Laplacian (— A)-prior. As all 
quadratic functions have zero third derivative such a prior favors, applied to 
L, quadratic log-likelihoods, corresponding to Gaussian probabilities P. In- 
deed, this is indicated by the striking difference between the regression func- 
tions in Fig. [16] and in Fig. [14]: The (— A 3 )-prior produces a much rounder 
regression function, especially at the x values which appear in the data. Note 
however, that in contrast to a pure Gaussian regression problem, in density 
estimation an additional non-quadratic normalization constraint is present. 

In Fig. |D] a similar prior has been applied, but this time being Gaussian 
in P instead of L. In contrast to a (— A 3 )-prior for L, a (— A 3 )-prior for P 
implements a tendency to quadratic P. Similarly to the difference between 
Fig. [14] and Fig. [16|, the regression function in Fig. [17] is also rounder than 
that in Fig. [15|. Furthermore, smoothing in Fig. [T7| is also less effective for 
smaller probabilities than it is in Fig. [H]. That is the same result we have 
found comparing the two priors for L shown in Fig. [T5| and Fig. [TJ]. This leads 
to deeper valleys in the L-landscape and to a smaller variance especially at 
x which appear in the training data. 

Fig. |21| depicts the results of a density estimation based on more than 
two data points. In particular, fifty training data have been obtained by 
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sampling with uniform p(x) from the "true" density 



P tme {x,y) - /^(/y|-r. /;, rili ) = 1_ | e ^ 4" +e J. (70S) 

/V 2 71" (To 

with (T = 1.5, h a (:r) = 125/18 + (5/9)z, /i 6 (z) = 145/18 - (5/9)x, shown in 
the top row of Fig. |18[ The sampling process has been implemented using the 
transformation method (see for example |[L91]1 ). The corresponding empirical 
density N/n ( |234| ) and conditional empirical density P cmp of Eq. ( |235| ), in 
this case equal to the extended -P emp defined in Eq. (|237|) , can be found in 
Fig. 



Fig. [H] shows the maximum posterior solution p(y\x,h*) and its loga- 
rithm, the energy El during iteration, the regression function 

h(x) = Jdyyp(y\x,h truc ) = j dy y P tiue (x,y), (709) 

(as reference, the regression function for the true likelihood p(y\x, h trne ) is 
given in Fig. ^), the average training error (or empirical (conditional) log- 
loss) 

1 - 

< ~ fop(y\x, h) > D = \°Zv{Vi\xh h), (710) 

n 

i=i 

and the average test error (or true expectation of (conditional) log-loss) for 
uniform p(x) 

< -\np(y\x, h) > Ptruo = - Jdydxp(x)p(y\x, h tvuc )\np(y\x, h), (711) 

which is, up to a constant, equal to the expected Kullback-Leibler distance 
between the actual solution and the true likelihood, 

KL(p(x,y\h trnc ),p(y\x, h)) = - [ dy dxp(x, y\h tIue ) In 7^—7— v ( 712 ) 

The test error measures the quality of the achieved solution. It has, in 
contrast to the energy and training error, of course not been available to the 
learning algorithm. 

The maximum posterior solution of Fig. [21] has been calculated by mini- 
mizing El using massive prior iteration with A = K + m 2 I, a squared mass 
m 2 = 0.01, and a (conditionally) normalized, constant L^ ' as initial guess. 
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Convergence has been fast, the regression function is similar to the true one 



(see Fig. |Tg). 

Fig. ^ compares some iteration procedures and initialization methods 
Clearly, all methods do what they should do, they decrease the energy func- 
tional. Iterating with the negative Hessian yields the fastest convergence. 
Massive prior iteration is nearly as fast, even for uniform initialization, and 
does not require calculation of the Hessian at each iteration. Finally, the 
slowest iteration method, but the easiest to implement, is the gradient algo- 
rithm. 



Looking at Fig. [22| one can distinguish data-oriented from prior-oriented 
initializations. We understand data-oriented initial guesses to be those for 
which the training error is smaller at the beginning of the iteration than for 
the final solution and prior-oriented initial guesses to be those for which the 
opposite is true. For good initial guesses the difference is small. Clearly, 
the uniform initializations is prior-oriented, while an empirical log-density 
ln(N/n + e) and the shown kernel initializations are data-oriented. 

The case where the test error grows while the energy is decreasing indi- 
cates a misspecified prior and is typical for overfitting. For example, in the 
fifth row of Fig. |22| the test error (and in this case also the average train- 
ing error) grows again after having reached a minimum while the energy is 
steadily decreasing. 

7.4.2 Density estimation with Gaussian mixture prior 

Having seen Bayesian field theoretical models working for Gaussian prior 
factors we will study in this section the slightly more complex prior mixture 
models. Prior mixture models are an especially useful tool for implementing 
complex and unsharp prior knowledge. They may be used, for example, 
to translate verbal statements of experts into quantitative prior densities 
131 , 132 , 133 , 134 , 135 |, similar to the quantification of "linguistic variables" 



by fuzzy methods |TT8|, [TTJ. 



We will now study a prior mixture with Gaussian prior components in L. 
Hence, consider the following energy functional with mixture prior 

E L = -ln$> ie -^ = -(L,N) + (e L ,A x ) - ln^e"^ (713) 

3 3 

with mixture components 

E s = -(L, N) + XE 0J + (e L ,A x ). (714) 



163 



We choose Gaussian component prior factors with equal covariances but dif- 
fering means 

^■ = i(L-t,-,K(L -tj)). (715) 



Hence, the stationarity equation for Functional ( |713|) becomes 



= iV - AK - a jtj ] - eLA x, 
with Lagrange multiplier function 



(716) 



A; 



N 



x 



XI X K \L-J2ajtj 



(717) 



and mixture coefficients 



Pj e 



-\En 



(718) 



The parameter A plays here a similar role as the inverse temperature (3 for 
prior mixtures in regression (see Sect. |6.3|) . In contrast to the /^-parameter 
in regression, however, the "low temperature" solutions for A — > oo are the 
pure prior templates tj, and for A — > the prior factor is switched off. 

Typical numerical results of a prior mixture model with two mixture 
components are presented in Figs. |23|-|28]. Like for Fig. [21], the true likelihood 
used for these calculations is given by Eq. ([7UH[) and shown in Fig. [TBI. The 



corresponding true regression function is thus that of Fig. [19|. Also, the same 
training data have been used as for the model of Fig. |2l| (Fig. |20|). The 
two templates t\ and ti which have been selected for the two prior mixture 



components are (Fig. [18|) 



h{x,y) 
t 2 (x,y) 



1 



(v— Ija) 



■2rr 



2V27TCT t 
1 



* + e 



(V-H2V 



(719) 
(720) 



with a t = 2, n a = fx 2 + 25/9 = 10.27, [i h = fj, 2 ~ 25/9 = 4.72, and /i 2 = 
15/2. Both templates capture a bit of the structure of the true likelihood, 
but not too much, so learning remains interesting. The average test error of 
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ti is equal to 2.56 and is thus lower than that of t% being equal to 2.90. The 
minimal possible average test error 2.23 is given by that of the true solution 
P true . A uniform P, being the effective template in the zero mean case of Fig. 
has with 2.68 an average test error between the two templates t\ and ti. 



Fig. |23| proves that convergence is fast for massive prior relaxation when 
starting from t\ as initial guess L^°\ Compared to Fig. 21 the solution is a bit 
smoother, and as template ti is a better reference than the uniform likelihood 
the final test error is slightly lower than for the zero-mean Gaussian prior 
on L. Starting from = ti convergence is not much slower and the final 
solution is similar, the test error being in that particular case even lower (Fig. 
p4|) . Starting from a uniform the mixture model produces a solution very 
similar to that of Fig. |2T] (Fig. |2J). 

The effect of changing the A parameter of the prior mixture can be seen 
in Fig. |2B| and Fig. |27|. Larger A means a smoother solution and faster 
convergence when starting from a template likelihood (Fig. ^6|). Smaller A 
results in a more rugged solution combined with a slower convergence. The 
test error in Fig. [Z7| already indicates overfitting. 

Prior mixture models tend to produce metastable and approximately sta- 
ble solutions. Fig. ^8] presents an example where starting with = ti the 
learning algorithm seems to have produced a stable solution after a few it- 
erations. However, iterating long enough this decays into a solution with 
smaller distance to t\ and with lower test error. Notice that this effect can 
be prevented by starting with another initialization, like for example with 
= t\ or a similar initial guess. 

We have seen now that, and also how, learning algorithms for Bayesian 
field theoretical models can be implemented. In this paper, the discussion 
of numerical aspects was focussed on general density estimation problems. 
Other Bayesian field theoretical models, e.g., for regression and inverse quan- 
tum problems, have also been proved to be numerically feasible. Specifically, 
prior mixture models for Gaussian regression are compared with so-called 



Landau-Ginzburg models in [131|. An application of prior mixture mod- 



els to image completion, formulated as a Gaussian regression model, can be 



found in | 136 |. Furthermore, hyperparameter have been included in numer- 
ical calculations in | |132| and also in |136| . Finally, learning algorithms for 
inverse quantum problems are treated in [|142|| for inverse quantum statistics, 
and, in combination with a mean field approach, in [|141|| for inverse quantum 
many-body theory. Time-dependent inverse quantum problems will be the 
topic of [[13 
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In conclusion, we may say that many different Bayesian field theoretical 
models have already been studied numerically and proved to be computation- 
ally feasible. This also shows that such nonparametric Bayesian approaches 
are relatively easy to adapt to a variety of quite different learning scenarios. 
Applications of Bayesian field theory requiring further studies include, for 
example, the prediction of time-series and the interactive implementation of 
unsharp a priori information. 
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Figure 13: Comparison of initial guesses P(°\x,y) for with two data 

points located at (3, 3) and (7, 12) within the intervals y G [1, 15] and x G 
[1, 10] with periodic boundary conditions. First row: P^ = CN. (The 
smoothing operator acts on the unnormalized N. The following conditional 
normalization changes the shape more drastically than in the example shown 
in the second row.) Second row: P^ = CP e mp. (The smoothing operator 
acts on the already conditionally normalized P emp .) The kernel C is given 
by Eq. ( ggg ) with C = (K + m 2 c I), m 2 c = 1.0, and a K of the form of Eq. 
( |699|) with A = A 4 = A 6 = 0, and A 2 = 0.1 (figures on the l.h.s.) or A 2 = 1.0 
(figures on the r.h.s.), respectively. 
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Figure 14: Density estimation with 2 data points and a Gaussian prior 
factor for the log-probability L. First row: Final P and L. Second row: 
The l.h.s. shows the energy El ( |108| ) during iteration, the r.h.s. the regres- 
sion function h(x) = Jdyyp(y\x,h trne ) = JdyyP tT ne(x,y). The dotted lines 
indicate the range of one standard deviation above and below the regression 
function (ignoring periodicity in x). The fast convergence shows that the 
problem is nearly linear. The asymmetry of the solution between the x- 
and y-direction is due to the normalization constraint, only required for y. 
(Laplacian smoothness prior K as given in Eq. ( |699| ) with X x = \ y = 1, \ 
= 0, A2 = 0.025, A4 = A6 = 0. Iteration with negative Hessian A = — H if 
positive definite, otherwise with the gradient algorithm, i.e., A = I. Initial- 
ization with = ln(CP emp ), i.e., L^ ' normalized to Jdye L = 1, with C of 
Eq. (|692|) and C = (K + rri^I), m 2 c = 0.1. Within each iteration step the 
optimal step width 77 has been found by a line search. Mesh with 10 points 
in x-direction and 15 points in y-direction, periodic boundary conditions in 
x and y. The 2 data points are (3, 3) and (7, 12).) 
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Figure 15: Density estimation with 2 data points, this time with a Gaussian 
prior factor for the probability P, minimizing the energy functional Ep ( |163| ). 
To make the figure comparable with Fig. [14] the parameters have been chosen 
so that the maximum of the solution P is the same in both figures (max P = 

0. 6). Notice, that compared to Fig. [14| the smoothness prior is less effective 
for small probabilities. (Same data, mesh and periodic boundary conditions 
as for Fig. [14|. Laplacian smoothness prior K as in Eq. ( p99| ) with \ x = \ y 
— 1, Ao = 0, A2 = 1, A4 = Ae = 0. Iterated using massive prior relaxation, 

1. e., A = K + m 2 I with m 2 = 1.0. Initialization with P^ = CP emp , with 
C of Eq. ( p92| ) so P^ is correctly normalized, and C = (K + m^T), m 2 c = 
1.0. Within each iteration step the optimal factor r] has been found by a line 
search algorithm.) 
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Figure 16: Density estimation with a (—A 3 ) Gaussian prior factor for 
the log-probability L. Such a prior favors probabilities of Gaussian shape. 
(Smoothness prior K of the form of Eq. ( |699[) with \ x = \ y = I, \$ = 0, \ 2 
= 0, A 4 = 0, A 6 = 0.01. Same iteration procedure, initialization, data, mesh 
and periodic boundary conditions as for Fig. [14].) 
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Figure 17: Density estimation with a (—A 3 ) Gaussian prior factor for the 
probability P. As the variation of P is smaller than that of L, a smaller A@ 
has been chosen than in Fig. |T7[ The Gaussian prior in P is also relatively 
less effective for small probabilities than a comparable Gaussian prior in L. 
(Smoothness prior K of the form of Eq. ( |699[) with X x = \ y = 1, A = 0, A 2 
= 0, A 4 = 0, A 6 = 0.1. Same iteration procedure, initialization, data, mesh 
and periodic boundary conditions as for Fig. |T5| .) 
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15 15 



Figure 18: First row: True density P true (l.h.s.) true log-density L true = 
logPtme (r.h.s.) used for Figs. |2l|-p8|. Second and third row: The two 
templates t\ and t 2 of Figs. ^-^8] for P (tf, l.h.s.) or for L (tf, r.h.s.), 
respectively, with tf = logtf. As reference for the following figures we 
give the expected test error / dy dx p(x)p(y\x, h true ) ha. p(y\x, h) under the true 
p(y\x, /itrue) for uniform p(x). It is for h true equal to 2.23 for template t\ equal 
to 2.56, for template t 2 equal 2.90 and for a uniform P equal to 2.68. 
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Regression function 
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Figure 19: Regression function ht Tne {x) for the true density P t rue of Fig. 



181, defined as h(x) = Jdyyp(y\x,h truc ) = JdyyP tIUC (x,y). The dashed lines 



indicate the range of one standard deviation above and below the regression 
function. 



Empirical density Conditional empirical density 




Figure 20: L.h.s.: Empirical density N(x, y)/n — J2i $( x ~ x i)5(y — yi)/ Yk 1- 
sampled from p(x, y\h true ) = p(y\x, h truc )p(x) with uniform p(x). R.h.s.: 
Corresponding conditional empirical density P e mp(x,y) = (N^- 1 A^)(x, y) = 
J2i 8(x — Xi) J2i 3(y — Hi) J2i I J2i $(x — Xi). Both densities are obtained from 
the 50 data points used for Figs. [2"T|-|2"5|. 
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Figure 21: Density estimation with Gaussian prior factor for log-probability 
L with 50 data points shown in Fig. Top row: Final solution 



P(x,y) = p(y\x,h) and L = log P. Second row: Energy E L (|108|) dur- 
ing iteration and final regression function. Bottom row: Average train- 
ing error — (1/n) Y%=i ^°SP{yi\ x h h) during iteration and average test error 
— jdy dx p(x)p(y\x, h true ) \np(y\x, h) for uniform p(x). (Parameters: Zero 
mean Gaussian smoothness prior with inverse covariance AK, A = 0.5 and K 
of the form ( |699| ) with X x = 2, A^ = 1, A = 0, A 2 = 1, A 4 = A 6 = 0, massive 
prior iteration with A = K + m 2 I and squared mass m 2 = 0.01. Initialized 
with normalized constant L. At each iteration step the factor i] has been 
adapted by a line search algorithm. Mesh with 10 points in x-direction and 
15 points in ^-direction, periodic boundary conditions in y.) 
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Figure 22: Comparison of iteration schemes and initialization. First row: 
Massive prior iteration (with A = K + m 2 I, m 2 = 0.01) and uniform initial- 
ization. Second row: Hessian iteration (A = — H) and uniform initialization. 
Third row: Hessian iteration and kernel initialization (with C = K + tuqI, 
tUq = 0.01 and normalized afterwards). Forth row: Gradient (A = I) with 
uniform initialization. Fifth row: Gradient with kernel initialization. Sixth 
row: Gradient with delta-peak initialization. (Initial L equal to ln(iV/n + e), 
e = 10~ 10 , conditionally normalized. For N/n see Fig. ^0|). Minimal num- 
ber of iterations 4, maximal number of iterations 50, iteration stopped if 
|£W _ < 10 -8 . Energy functional and parameters as for Fig. ^T|. 
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Figure 23: Density estimation with a Gaussian mixture prior for log- 
probability L with 50 data points, Laplacian prior and the two template func- 
tions shown in Fig. [18]. Top row: Final solution P(x, y) = p{y\x, h) and L = 
log P. Second row: Energy Energy Ei (|713|) during iteration and final regres- 
sion function. Bottom row: Average training error -(1/n) Y%=i ^°SP(Vi\ x i: h) 
during iteration and average test error — fdy dxp(x)p(y\x, h tTne ) \np(y\x, h) 
for uniform p(x). (Two mixture components with A = 0.5 and smoothness 
prior with Ki = K 2 of the form (|699|) with X x = 2, X y = 1, Ao = 0, A2 
= 1, A4 = Ag = 0, massive prior iteration with A = K + m 2 I and squared 
mass m 2 = 0.01, initialized with L = t\. At each iteration step the factor 
7] has been adapted by a line search algorithm. Mesh with l x = 10 points 
in x-direction and l y = 15 points in y-direction, n = 2 data points at (3, 3), 
(7,12), periodic boundary conditions in y. Except for the inclusion of two 
mixture components parameters are equal to those for Fig. 
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Figure 24: Using a different starting point. (Same parameters as for Fig. 



23, but initialized with L = t2-) While the initial guess is worse then that of 



Fig. E31 the final solution is even slightly better. 
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Figure 25: Starting from a uniform initial guess. (Same as Fig. 23, but 
initialized with uniform L.) The resulting solution is, compared to Figs. ^3 
and |24], a bit more wiggly, i.e., more data oriented. One recognizes a slight 
"overfitting" , meaning that the test error increases while the training error is 
decreasing. (Despite the increasing of the test error during iteration at this 
value of A, a better solution cannot necessarily be found by just changing 
A-value. This situation can for example occur, if the initial guess is better 
then the implemented prior.) 
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Figure 26: Large A. (Same parameters as for Fig. ^3], except for A = 1.0.) 
Due to the larger smoothness constraint the averaged training error is larger 



than in Fig. 23. The fact that also the test error is larger than in Fig. 23 



indicates that the value of A is too large. Convergence, however, is very fast. 
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Figure 27: Overfitting due to too small A. (Same parameters as for Fig. 

except for A = 0.1.) A small A allows the average training error to 
become quite small. However, the average test error grows already after two 
iterations. (Having found at some A-value during iteration an increasing test 
error, it is often but not necessarily the case that a better solution can be 
found by changing A.) 



199 



15 



15 



Regression function 




Figure 28: Example of an approximately stable solution. (Same parameters 
as for Fig. 23, except for A = 1.2, m 2 = 0.5, and initialized with L = £2-) A 
nearly stable solution is obtained after two iterations, followed by a plateau 
between iterations 2 and 6. A better solution is finally found with smaller 
distance to template £1. (The plateau gets elongated with growing mass m.) 
The figure on the l.h.s. in the bottom row shows the mixing coefficients cij of 
the components of the prior mixture model for the solution during iteration 
(d, line and 02, dashed). 
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